Los Alamos Science

SOME RANDOM NUMBERS

The random numbers below are a small fraction of the 10,000 generated in the early 1900s by L. H. C. Tippett, then a member of the Biometric Laboratory at University College, London. Tippett generated the numbers, which were used in statistical sampling procedures, by selecting 40,000 single digits from census reports and combining them by fours. The collection of numbers was originally handwritten; the excerpt here is reprinted, with permission, from a version published in 1959 by Cambridge University Press (Random Sampling Numbers, Tracts for Computers, edited by E. S. Pearson, Number 15).

particular problem. There are many things you can do with problems besides solving them. First you must define them, pose them. But then of course you can also refine them, depose them, or repose them, even dissolve them! A given problem may send you looking for analogies, and some of these may lead you astray, suggesting new and different problems, related or not to the original. Ends and means can get reversed. You had a goal, but the means you found didn't lead to it, so you found a new goal they did lead to. It's called play. Cyril Smith has argued persuasively, and with good historical evidence, that play, not utility, has long been the mother of invention (Smith 1981). Utility has been only the nursemaid. Creative mathematicians play a lot; around any problem really interesting they develop a whole cluster of analogies, of playthings. One of Stan's playthings, his "lucky numbers," got considerable attention (Gardiner, Lazarus, Metropolis, and Ulam 1956). These numbers, which are generated by a sieve quite like that for the primes, have no particular arithmetical properties; they are just lucky to survive the sieving. A number of us got involved in studying their long-run distribution, which turned out to very close to that of the primes (Hawkins and Briggs 1957).

It is here I should mention an important one of Stan's contributions in the general grouping of nonstandard iterative processes. There was nothing lazy about his pursuit of a really good idea. "Pursuit" is probably the wrong word; it implies he already had the idea on the run. In the beginning it maybe was more like a roundup, a nudging together of possible example after possible example. I recall his ruminations about the Monte Carlo method in 1944, when already he was talking about it. More than fifty years ago there began to appear small compilations of random numbers. I remember the incredulity of a good physicist friend when I showed him such a listing. He knew Jahnke-Emde, of course, an old book of tables of almost every thenstandard function. But random numbers? Statisticians were the wave of the future in those days. They alone used random numbers—for honest sampling procedures— even though nobody quite knew what "random" meant. But Stan foresaw that when high-speed computers should come along, they might well be used to imitate various deterministic or stochastic processes. Stan's phrase was "playing the game." The question was how to provide a computer with random numbers. I favored the built-in alpha counter, but that violated the sensibilities of mathematicians. There is indeed a grey area between chance and determinism, occupied by pseudo-random sequences. Even that early Stan was exploring it.

I think the most original part of Stan's early thinking about such matters was the idea that you could transform the equations describing a completely deterministic process into a mathematical form that also describes a stochastic one, and then you could get approximate solutions by playing the game repeatedly on a high-speed computer. But Monte Carlo is not my topic, except again as it affects my picture of Stan Ulam, teacher.

One of the topics we got on to later was a Monte Carlo approach to the theory of prime numbers. Pick a number at random from the neighborhood of some number N. The probability of picking a prime there is about, their approximate frequency. Why not reverse the process and produce a random sequence of numbers that mimic the primes? Try out each number N = 2,3,4,5,... against a game of chance for which the probability of "winning" is, and select it for the sequence only if it wins. The

properties of such a sequence, or rather of many such sequences examined together, might throw light on some aspects of prime-number theory. The statistician Harald Cramer, it turned out, had already written about that. My own next bright idea was to turn the sieve of Eratosthenes itself into a Monte Carlo device. Drop out half the numbers beyond 2, namely those that "lose" a game of chance for which the probability of losing is. Let N be the first survivor. Now drop out of the numbers beyond N, those that lose another game of chance for which the probability of losing is. Keep repeating the process indefinitely, each time basing the sieving on the first survivor. One can play this game on a computer, which I did. But the theory of these "random primes" turned out to be not too difficult (Hawkins 1974), and it showed that the primenumber sequence could be regarded as one of an infinite family of sequences very much like it in their average properties. It supports some familiar conjectures about the primes and suggest others. The best result, I think, is that the famous and unproved Riemann hypothesis turns out to be true of "almost all" sequences generated by the random sieve. This hypothesis is a more recondite example of the kind of transformation I have talked about. It concerns the zeros of a certain function in the complex domain and, if true, implies a whole batch of propositions in number theory. Many of these can be proved independently, and none have been disproved. But some seem to be beyond the range of simple methods. That the Riemann hypothesis can be shown to be true of the random primes, and thus of almost all prime-like sequences, surely makes even more unlikely the possibility that the primes themselves should prove an exception.

ere, finally, I should mention another component of Stan's work, one that I can also trace back to early Los Alamos days. It grew later to very substantial proportions. One beginning I recall was to discuss a stochastic branching process that requires the "mating" of two "particles" from one generation in order to produce "offspring" for the next: sexual reproduction. Here the branching goes in both time directions, backward genealogically and forward by descent. The theory of this branching is essentially nonlinear. "Sex," Stan said, "is quadratic!" I had indeed examined one kind of nonlinear stochastic process, a chain reaction in which depletion of fuel, or of nutrient in the case of bacterial reproduction, is a factor. This led to a stochastic version of the well-known logistic curve of growth, which at first rises exponentially and then tapers off to a zero or negative slope. My work had a certain mathematical interest because it showed that the statistical fluctuations in such a nonlinear process can also change its average character; they don't "average out."

Such work as this might have stayed in abeyance except for Stan's development of other and much broader interests, namely in mathematical models of growth and reproduction. I remember approaching him with my own new-found interest in Claude Shannon's work on information theory and in the discovery of Watson and Crick. I wanted to define a measure of biological complexity, or organization, in informationtheory terms, and we were immediately at loggerheads. He wanted to insist that very simple instructions could produce very complex patterns and I that such simplicity would nevertheless limit the variety of such patterns. Each of us was defending a different meaning of "complex." I already knew of his work (or play) with computergenerated growth patterns (Ulam 1962) but hadn't realized fully the range of ideas he was bent on exploring. Once more it was that flanking move. The genetic instruction

[merged small][merged small][merged small][merged small][merged small][graphic][merged small]

of biological growth and reproduction is a vast and still mostly uncharted domain for investigation. But once more the "village idiot" could invent all kinds of very simple processes bordering that domain. The idea of "growing" elaborate dendritic patterns, "organisms," by the endless repetition of a few simple "genetic" instructions, applied in each cycle to the results of previous cycles, was another in the category of iterative processes that lay beyond the range of standard methods. It later became the basis for the famous "game of life"-was Stan its first inventor? I don't know. I connect this work also with Stan's important work on the nature of and approach to equilibrium in even slightly nonlinear iterative processes. In the years following he became quite deeply involved in more realistic problems of genetics, but I mostly lost touch.

One of these problems, now well known and used in molecular genetics, came from Stan's deep familiarity with measure theory. Suppose a deck of cards can be shuffled only by several allowable operations. Knowing these and the end result of a shuffling, find the smallest number of allowable operations that accomplishes the given result, and call it the "distance" between the two orderings. Two decks of cards, or two nucleic acid strands, might appear very different in an item-by-item comparison yet be by shuffling history very close. Stan was a visiting professor at the University of Colorado's medical school when he worked on this, and I have a nice story from Theodore Puck. Stan got so interested in the mathematics (now not an iterative process) that he seemed to be ignoring the relevant biology. Reproached, he mended his ways. But he began his final talk on the subject with an imperative: "Ask not what mathematics can do for biology; ask rather what biology can do for mathematics!"

the sixties and seventies I became more and more concerned with practical and theoretical work relating to elementary-school education in mathematics and science, to "school-doctoring." Toward this new career of mine Stan was-tolerant. We enjoyed good conversations but little time for shared work. It was only last year that I was suddenly recalled to our earliest association, catching up on some work he had done in population genetics and related matters. With characteristic initial disregard for humdrum scholarship, he had reinvented and extended some of the existing theory, developed first by R. A. Fisher and Sewall Wright.

I had known generally about this work but had missed one small paper, one in which he and Jan Mycielski formulated the basic theory of stochastic pairing, the branching process involved in sexual reproduction (Mycielski and Ulam 1969). Its main focus was not, however, on the fluctuational aspect of the process but on the average distribution and evolution of mutations within a species. The paper set forth three measures of the "distance" between two individuals. I shall mention only one of these, proposed by Mycielski. It is simply the sum, over the present generation and all past generations, of symmetric differences in genealogy; that is, the number of entries present in one family tree and absent from the other, plus the number present in the other and absent from the one. Since sexual reproduction is already a stochastic process, this measure is genetically crude (for example, it ignores sibling diversity). But it is surely a plausible first (or if you wish, zeroth) approximation-a measure of purely genealogical, not yet of genetic, distance.

Stan had done (as he often had for other problems) some Monte Carlo simulations assuming a constant population size of 2N, random pairing between the N males and

females in each generation, and two offspring, one male and female, per pairing. The simulations had told him that the genealogical distance (as defined above) between two randomly selected individuals from the same generation was, on the average, about 4N. It was shown subsequently, by Kahane and Marr, that the average distance is precisely 4(N − 1). Intrigued by all this and some explanation by Jan Mycielski, I found myself recovering some of our ancient lore, or reinventing it, and realized that we had been within an inch of this more recent thought, and then lost it, forty years ago!

Stan was an accomplished Latinist, but he deferred to my own (rather slight) knowledge of Greek. "Tell me," he once asked, "why is it that people are always saying 'Eureka!' and never-what is the Greek?—I lost it!' After all, it's much more likely." I got the Greek for him, something like "Owλa!." It is by chance pronounced very like the modern French "Oh là là!," which can have a related meaning, "I am undone!" At any rate it did come back, and we had indeed lost it.

Not long after Stan retired from Los Alamos and came to Boulder, we touched on the subject again. If we trace the branching of ancestors back far enough, we of course share almost all of them. All men overwhelmingly are indeed almost-brothers; they are N-fold nth cousins. We jokingly speculated about the distance back to some common ancestor Stan a Polish Jew and I an Anglo-American. Very likely less than twenty generations. All this was play, rediscovering the obvious along a pathway paved with numbers. I also remember that Stan the set-theorist poked his head out here with a "little remark." In a population of fixed finite size and infinite duration, everyone could be assigned an integer as a proper name, but the set of all sets of ancestors, of the genealogical "names," would be uncountable. We also noted more mundanely that the backward count of ancestors would fit some logistic curve, going up exponentially at first and then flattening out.

When I read the above paper, I was reminded of all this and noticed that the random pairing process was after all quite tractable. ("Noticed" is one of Stan's words; for me it represented hours of fussing about.) If, among the 2N-member population that existed n generations ago, your mother had r female ancestors and your father had s, then the probability q1 that they share k female ancestors in that generation is given by

The sequence of probabilities defined by Eq. 1 is the hypergeometric distribution, a standard textbook entry. The applicability of this distribution to the problem at hand can perhaps be more easily seen by recasting it in textbook terms. Suppose we have an N-element set (the ancestral female population), r members of which are white (your mother's female ancestors) and the remainder are non-white (say black). Suppose we choose, randomly from among that set, an s-element set (your father's female ancestors, an identification allowed by the assumption of random pairing). The problem then is to find the probability q that exactly k elements of the chosen set are white (shared ancestors). Now the k white elements can be selected for the s-element set in () ways and the (s−k) black elements in (N) ways. Since any choice of k white elements can be combined with any choice of black elements, the k white elements can be selected

s-k

in a total of ()() ways. To obtain qk, this number is then divided by (~), the total number of ways of choosing the s-element set. (I should add that if the words "mother" and "father" are interchanged, along with r and s, the answer is the same; though the resulting formula for qk will look different, it is not different in value.) Using Eq. 1, we can now deduce the probability Pn+1, that, (n + 1) generations back, you yourself have t = r+s-k female ancestors. Let Pn, and Pn,s be the respective probabilities that, among that generation, your mother has r female ancestors and that your father has s. Since various values for r, s, and k = r + s − t can yield a particular t value, Pn+1, is a sum over those variables:

The Laplace generating function for this sequence of probabilities, call it fn+1(x), is therefore given by

Equation 2 does not lend itself to derivation of an elegant recurrence relation between fn+1(x) and fn(x), but it does provide such a relation between An+1 and An, where A, is the expected, or average, number of female ancestors n generations back. This relation

(Interestingly enough, the right side of Eq. 4 is also the answer to a much simpler problem: If 2" objects are distributed randomly among N boxes, what is the expected number of non-empty boxes?) If we identify the term A/N as the average number of shared female ancestors n generations back, then Eq. 4 defines just the logistic curve Stan and I had seen to describe the expected loss of ancestry; the difference between 2" and An (the average number of your female ancestors n generations back) is just A/N (the average number of female ancestors among that generation shared by your parents).

Now Mycielski's definition of the expected genealogical distance between two randomly chosen individuals of the same generation can be written 2(An – A2 /N ). We can evaluate this distance by using Eqs. 3 and 4:

which, when doubled to include male ancestors, is just the result obtained by Kahane and Marr.

« Previous Continue »

Books