o2(X) = t dF (t). tf (t) dt. The expected value is one of the two commonly occurring averages in probability and statistics; the other is the variance of X, denoted by o2(X) or var(X). The variance is defined as the expected value of the square of the deviation of X from its mean: = var(X) = E ((X − E (X))2) = E (x2) − (E (X))2. The standard, or root-mean-square, deviation of X is defined as σ(X) = √√var(X). Figures 3 and 4 illustrate two distribution functions, the binomial distribution function for the number of heads obtained in five tosses of a fair coin and a normal distribution function with a positive mean. The Laws of Large Numbers A historically important problem in probability theory and statistics asks for estimates on how a random variable deviates from its mean, or expected, value. A simple rough estimate is, of course, its root-mean-square deviation. An estimate of a different nature was obtained by the nineteenth-century mathematician Chebyshev. This estimate, known as Chebyshev's inequality, gives an upper limit on the probability that a random variable Y deviates from its mean E(Y) by an amount equal to or greater than a (a > 0): Chebyshev's inequality: P(Y - E(Y) > a) ≤ var(Y)/a2. This fundamental inequality will lead us to the famous laws of large numbers, which tell us about average values for infinite sequences of random variables. We begin by returning again to the coin loaded in such a way that p is the probability of a head in a single toss. If this coin is tossed a large number of times n, shouldn't the frequency of heads, NH /n, be approximately equal to p, at least in some sense? This question can be answered on several levels. Let X, be the random variable describing the outcome of the ith toss. Set X; = 1 if the ith toss is a head and X, = 0 if BINOMIAL DISTRIBUTION FUNCTION Fig. 3. The distribution function F(t) for the number of heads obtained in n independent tosses of a fair coin is a binomial distribution, so called because the probability of obtaining k heads in n tosses of the coin is given by a formula involving binomial coefficients, namely (). Shown here is the binomial distribution function for the number of heads obtained in five tosses of the coin. The value of F(t) equals the probability that the number of heads is equal to or less than t. Fig. 4. So many random variables can be described, at least approximately, by the distribution function shown in (a) that it is known as the normal distribution function. Examples of such random variables include the number of heads obtained in very many tosses of a coin and, as a general experimental fact, accidental errors of observation. The value of F(t) equals the probability that the value of the random variable is equal to or less than (t − m)/σ, where m is the mean, or expected, value of the random variable and σ is its standard deviation. (The mean here is assumed to be positive.) Shown in (b) is the normal density function f(t) = dF(t)/dt, which gives the probability that the value of the random variable is (t - m)/o. Thus, for each > 0 0 1-p Thus, on the simplest level our guess is right: The frequency of heads, NH /n, is approximately equal to p in the sense that the expected value of NH /n is p. But surely, even in a very long series of tosses, it would be foolish to expect NH /n to exactly equal p (and NT/n to exactly equal 1 − p). What one is looking for is a statement that holds only in the limit as the number of tosses becomes infinite. Bernoulli proved such a theorem: As n gets larger and larger, the probability that NH/n differs from its expected value p by more than a positive amount tends to 0: So, by Chebyshev's inequality t<0 n + Xn)/n) = (1/n) ΣE (X; ) = (1/n)(np) = p. i=1 lim P(NH/n - p❘ ≥ €) = 0, n where P is the probability measure on N, the space of all sequences of H's and T ́s of length n. No matter what positive is chosen, the probability that the difference between the frequency of heads and p, the probability of a head in a single trial, exceeds € can be made arbitrarily small by tossing the coin a sufficiently large number of times. Let us see how Bernoulli's theorem follows from Chebyshev's inequality. First, notice that var(X;) = p(1 − p) for all i. Second, the random variables X1,...,X1 are independent (the outcome of the ith toss has no influence on the outcome of jth). Now, from the fact that E(XY) = E(X)E (Y) for independent random variables, we get ∞18 n var(X1 + + X1) = Σ var(X1) = np(1 − p). i=1 Pn(NH/n-p≥ €) ≤ np(1 − p)/n2e2 = p(1 - p)/e2n. lim P(NH/n - p❘ ≥ €) = 0. Notice that the measure-theoretic background of Bernoulli's theorem is trivial (at least as far as coin-tossing is concerned), since the events of interest correspond to finite sets. That is, for each n we need only estimate how many trials of length n there are such that the number of heads differs from np by more than en. Nevertheless, the simple argument just given can be generalized to prove the famous weak law of large numbers. Los Alamos Science Special Issue 1987 P lim 811 where N, (w) is the number of H's among the first n letters of the infinite sequence w. It takes some work just to show that A is an event in the sample space N. Unlike the question that led to the weak law of large numbers, this question required the full apparatus of modern probability theory. An extension of Borel's result by Kolmogorov is known as the strong law of large numbers. Nn(w) n Strong law of large numbers: Let X1, X2, X3,... be independent, identically distributed random variables such that E(X1) < ∞. Then = p, P (lim (X + ··· + Xn) /n = E(X1)) = .. = 1. An Application of the Strong Law of Large Numbers. Let us illustrate the power of the strong law of large numbers by using it to answer the following question: What is the probability that, in an infinite sequence of tosses of a fair coin, two heads occur in succession? We will first answer this question using only the rules governing the probabilities of independent events. In particular, we will use the axioms of countable additivity and complementarity and the rule of multiplication of probabilities. Let A be the event that a head occurs on the (2k 1)th toss and on the (2k)th toss. Each A, is an elementary event, and P(A) = 1/4. Now, by the axiom of countable additivity, UA is an event; in particular, it is the event that, for some k, heads occur on the (2k − 1)th and 2k th tosses. By the axiom of complementarity, Since the events A1, A2, A3.... are independent, the events - A1, A2, N-A3.... are also independent, and we can apply the rule of multiplication of probabilities: P(-) Finally, by the axiom of complementarity, P (UA) = 1; that is, there exists, with probability 1, some k such that the (2k − 1)th and (2k)th tosses are heads. ww Now we will answer the same question by using the strong law of large numbers. Let X be the random variable such that n = lim P(A) = lim (3/4)′′ = 0. n1∞ Then X1, X2, X3, ... is a sequence of independent random variables. Also, they all have the same distribution: (P(X; = 1) = 1/4, P(X; = 0) = 3/4, and E(X) = 1/4. Therefore, according to the strong law of large numbers, lim (X+...+X)/n = 1/4 with probability 1. This result is stronger than that obtained above. It guarantees, with probability 1, the existence of infinitely many k's such that heads occur on the (2k - 1)th and (2k)th tosses; further, the set of all such k's has an arithmetic density of 1/4. Borel's theorem marked the beginning of the development of modern probability theory, and Kolmogorov's extension to the strong law of large numbers greatly expanded its applicability. To quote Kac and Ulam: "Like all great discoveries in mathematics the strong law of large numbers has been greatly generalized and extended; in the process it gave rise to new problems, and it stimulated the search for new methods. It was the first serious venture outside the circle of problems inherited from Laplace, a venture made possible only by developments in measure theory. These in turn were made possible only because of polarization of mathematical thinking along the lines of set theory." The polarization Kac and Ulam were referring to concerns the great debate at the turn of the century about whether the infinite in mathematics should be based upon Cantor's set theory and its concomitant logical difficulties. The logical problems have been met, and today we use Cantor's theory with ease. The Monte Carlo Method. One of Stan Ulam's great ideas, which was first developed and implemented by von Neumann and Metropolis, was the famous Monte Carlo method. It can be illustrated with Chebyshev's inequality. Suppose that we need to quickly get a rough estimate of (sinx)/x3 dx. Setting t = 1/x, the problem then is to estimate for sin(1/1) dt. Let y1....,y, be independent random variables each uniformly distributed on [0,1]. That is, for all i, P(a < y < b) = ba, where (a, b) is a subinterval of [0,1]. Now set ƒ (t) = t sin(1/t) and for each i let X; = ƒ (y;). Then Los Alamos Science Special Issue 1987 (1/n) Σ. i=1 t sin(1/t) dt > 0.005 < 0.1. 134,000 In other words, if we chose 134,000 numbers y1,..., Y134,000 independently and at random from [0,1], then we are 90 percent certain that (1/134, 000) Σ; y; sin(1/y;) differs from the integral by no more than 0.005. So, if we can statistically sample the unit interval with numbers y1,. Yn, then y; sin(1/y;)≈ X; /a2 ≤ var(X1)/na2. 12dt = 1/3. (The reader may well wonder why such a large number of sample points is required to be only 90 percent certain of the value of the integral to within only two decimal places. The answer lies in the use of Chebyshev's inequality. By using instead the stronger central limit theorem, which will be introduced below, many fewer sample points are needed to yield a similar estimate.) The Monte Carlo method is a wonderful idea and, of course, tailor-made for computers. Although it might be regarded simply as an aspect of the more ancient statistical sampling technique, it had many exciting new aspects. Three of these are (1) a scope of application that includes large-scale processes, such as neutron chain reactions; (2) the capability of being completely implemented on a digital computer; and (3) the idea of generating random numbers and random variables. How do we mechanically produce numbers y.....y in [0,1] such that the y, 's are independent and identically distributed? The answer is we don't. Instead, so-called pseudo-random numbers are generated. Many fascinating problems surfaced with the advent of Monte Carlo. Dealing with them is one of the major accomplishments of the group of intellects gathered at Los Alamos in the forties and the fifties. (See "The Beginning of the Monte Carlo Method.") |