Dictionary of the History of Ideas Studies of Selected Pivotal Ideas |

V. |

V. |

V. |

V. |

V. |

V. |

V. |

VII. |

VII. |

III. |

III. |

I. |

II. |

V. |

V. |

VI. |

II. |

V. |

V. |

VII. |

VII. |

I. |

VI. |

VI. |

VI. |

III. |

III. |

VI. |

III. |

III. |

III. |

III. |

III. |

III. |

III. |

III. |

III. |

III. |

III. |

III. |

V. |

V. |

III. |

I. |

VI. |

III. |

VI. |

I. |

III. |

VII. |

I. |

I. |

IV. |

VI. |

V. |

VI. |

VI. |

IV. |

III. |

V. |

VI. |

III. |

VI. |

VI. |

VI. |

III. |

VI. |

VI. |

VI. |

VI. |

II. |

II. |

II. |

VII. | PROBABILITY:OBJECTIVE THEORY |

IV. |

IV. |

V. |

VI. |

VI. |

V. |

Dictionary of the History of Ideas | ||

#### PROBABILITY:

OBJECTIVE THEORY

*I. THE BEGINNING*

*1.* Games and gambling are as old as human history.

It seems that gambling, a specialty of the human spe-

cies, was spread among virtually all human groups. The

Rig Veda, one of the oldest known poems, mentions

gambling; the Germans of Tacitus' times gambled

heavily, so did the Romans, and so on. All through

history man seems to have been attracted by uncer-

tainty. We can still observe today that as soon as an

“infallible system” of betting is found, the game will

be abandoned or changed to beat the system.

While playing around with chance happenings is

very old, attempts towards any systematic investigation

were slow in coming. Though this may be how most

disciplines develop, there appears to have been a par-

ticular resistance to the systematic investigation of

chance phenomena, which by their very nature seem

opposed to regularity, whereas regularity was generally

considered a necessary condition for the scientific un-

derstanding of any subject.

The Greek conception of science was modelled after

the ideal of Euclidean geometry which is supposedly

derived from a few immediately grasped axioms. It

seems that this rationalistic conception limited philos-

ophers and mathematicians well beyond the Middle

Ages. Friedrich Schiller, in a poem of 1795 says of the

“sage”: *Sucht das vertraute Gesetz in des Zufalls
grausenden Wundern/Sucht den ruhenden Pol in der
Erscheinungen Flucht* (“Seeks the familiar law in the

dreaded wonders of chance/Looks for the unmoving

pole in the flux of appearances”).

*2.* However, the hardened gambler, not influenced

by philosophical scruples, could not fail to notice some

sort of long-run regularity in the midst of apparent

irregularity. The use of loaded dice confirms this.

The first “theoretical” work on games of chance

is by Girolamo Cardano (Cardanus), the gambling

scholar: *De ludo aleae* (written probably around 1560

but not published until 1663). Todhunter describes it

as a kind of “gambler's manual.” Cardano speaks of

chance in terms of the frequency of an event. His

mathematics was influenced by Luca Pacioli.

A contribution by the great Galileo was likewise

stimulated directly by gambling. A friend—probably

the duke of Ferrara—consulted Galileo on the follow-

ing problem. The sums 9 and 10 can be each produced

by three dice, through six different combinations,

namely:

9 = 1 + 2 + 6 = 1 + 3 + 5 = 1 + 4 + 4 = 2+ 2 + 5

= 2 + 3 + 4 = 3 + 3 + 3,

10 = 1 + 3 + 6 = 1 + 4 + 5 = 2 + 2 + 6 = 2 + 3 + 5

= 2 + 4 + 4 = 3 + 3 + 4,

and yet the sum 10 appears more often than the sum

9. Galileo pointed out that in the above enumeration,

for the sum 9, the first, second, and fifth combination

can each appear in 6 ways, the third and fourth in

3 ways, and the last in 1 way; hence, there are alto-

gether 25 ways out of 216 compared to 27 for the sum

10. It is interesting that the “friend” was able to detect

empirically a difference of 1/108 in the frequencies.

*3.* Of the same type is the well-known question

gambler. It was usual among gamblers to bet even

money that among 4 throws of a true die the “6” would

appear at least once. De Méré concluded that the same

even chance should prevail for the appearance of the

“double 6” in 24 throws (since 6 times 6 is 36 and

4 times 6 is 24).

*Un problème relatif aux jeux de hasard,*

proposé à un austère Janséniste par un homme du

monde a été l'origine du calcul des probabilités(“A

proposé à un austère Janséniste par un homme du

monde a été l'origine du calcul des probabilités

problem in games of chance, proposed to an austere

Jansenist by a man of the world was the origin of

the calculus of probability”), writes S. D. Poisson in

his

*Recherches sur la probabilité des jugements*...

(Paris, 1837). The Chevalier's experiences with the

second type of bet compared unfavorably with those

in the first case. Putting the problem to Blaise Pascal

he accused arithmetic of unreliability. Pascal writes

on this subject to his friend Pierre de Fermat (29 July

1654):

*Voilà quel était son grand scandale que lui*

faisait dire hautement que les propositions [proportions

(?)] n'étaient pas constantes et que l'arithmétique se

démentait(“This was for him a great scandal which

faisait dire hautement que les propositions [proportions

(?)] n'étaient pas constantes et que l'arithmétique se

démentait

made him say haughtily that the propositions [propor-

tions (?)] are not constant and that arithmetic is self-

contradictory”).

Clearly, this problem is of the same type as that of

Galileo's friend. Again, the remarkable feature is the

gambler's accurate observation of the frequencies.

Pascal's computation might have run as follows. There

are 64 = 1296 different combinations of six signs *a, b,
c, d, e, f* in groups of four. Of these, 54 = 625 contain

no “

*a*” (no “6”) and, therefore, 1296 - 625 = 671

contain at least one “

*a,*” and 671/1296 = 0.518 =

*p*1 is

the probability for the first bet. A similar computation

gives for the second bet

*p*2 = 0.491, indeed smaller

than

*p*1.

Both Fermat and Pascal, just as had previously

Galileo, found it natural to base their reasoning on

observed frequencies. They were interested in the an-

swers to actual problems and created the simplest

“theory” which was logically sound and explained the

observations.

*4.* Particularly instructive is another problem exten-

sively discussed in the famous correspondence between

the two eminent mathematicians, the *problème des
parties* (“problem of points”), which relates to the

question of the just division of the stake between

players if they decide to quit at a moment when neither

has definitely won. Take a simple case. Two players,

A and B, quit at a moment when A needs two points

and B three points to win. Then, reasons Pascal, the

game will certainly be decided in the course of four

more “trials.” He writes down explicitly the combina-

tions which lead to the winning of A, namely

*aaaa,*

aaab, aabb.Here,

aaab, aabb.

*aaab*stands for four different ar-

rangements, namely

*aaab, aaba,*... and similarly

*aabb*

stands for six different arrangements. Hence, 1 + 4 +

6 = 11 arrangements out of 16 lead to the winning

of A and 5 to that of B. The stake should, therefore,

be divided in the ratio 11:5. (It is worthwhile men-

tioning that mathematicians like Roberval and

d'Alembert doubted Pascal's solution.)

The same results were obtained in a slightly different

way by Fermat. The two greatest mathematicians of

their time, Pascal and Fermat, exchanged their dis-

coveries in undisturbed harmony. In the long letter

quoted above, Pascal wrote to Fermat: *Je ne doute plus
maintenant que je suis dans la vérité après le rencontre
admirable où je me trouve avec vous.... Je vois bien
que la vérité est la même à Toulouse et à Paris* (“I do

not doubt any longer that I have the truth after finding

ourselves in such admirable agreement.... I see that

truth is the same in Toulouse and in Paris”). In connec-

tion with such questions Pascal and Fermat studied

combinations and permutations (Pascal's

*Traité du tri-*

angle arithmétique,1664) and applied them to various

angle arithmétique,

problems.

*5.* We venture a few remarks regarding the ideas

on probability of the great philosophers of the seven-

teenth century. “Probability is likeness to be true,” says

Locke. “The grounds of it are in short, these two

following. First, the conformity of anything with our

knowledge, observation, and experience. Secondly, the

testimony of others” (*Essay concerning Human Under-
standing,* Book IV). This is the empirical viewpoint,

a viewpoint suggested by the observation of gambling

results as well as of deaths, births, and other social

happenings. “But,” continues Keynes “in the meantime

the subject had fallen in the hands of the mathe-

maticians and an entirely new method of approach was

in course of development. It had become obvious that

many of the judgments of probability, which we, in

fact, make do not depend upon past experience in a

way which satisfied the canon laid down by the logi-

cians of Port Royal and by Locke” (“

*La logique ou*

l'art de penser...,” by A. Arnauld, Peter Nicole, and

l'art de penser

others, 1662, called the “Port Royal Logic”). As we

have seen, in order to explain observations, the mathe-

maticians created a theory

*based on the counting of*

combinations.The decisive assumption was that the

combinations.

observed frequency of an event (e.g., of the “9” in

Galileo's problem) be proportional to the corre-

sponding relative number of combinations (there,

25/216).

*6.* We close our description of the first steps in

probability calculus with one more really great name,

though his fame was not due to his contributions to

our subject: Christian Huygens. Huygens heard through

culty in obtaining reliable information about the prob-

lem and the methods of the two French mathe-

maticians. Eventually, Carcavi sent him the data as

well as Fermat's solution. Fermat even posed to

Huygens further problems which Huygens worked out

and later included as exercises in a work of his own.

In this work,

*De ratiociniis in aleae ludi*(“On reasoning

in games of chance”) of 1657, he organized all he knew

about the new subject. At the end of the work he

included some questions without indicating the method

of solution. “It seems useful to me to leave something

for my readers to think about (if I have any readers)

and this will serve them both as exercises and as a way

of passing the time.” Jakob (James) Bernoulli gave the

solutions and included them in his

*Ars conjectandi.*The

work of Huygens remained for half a century

*the*

introduction to the “Calculus of Probability.”

*7.* A related type of investigation concerned mor-

tality and annuities. John Graunt started using the

registers of deaths kept in London since 1592, and

particularly during the years of the great plague. He

used his material to make forecasts on population

trends (*Natural and Political Observations... upon
the Bills of Mortality,* 1661). He may well be considered

as one of the first statisticians.

John de Witt, grand pensioner of Holland, wrote

on similar questions in 1671 but the precise content

of his work is not known. Leibniz was supposed to

have owned a copy and he was repeatedly asked

by Jakob Bernoulli—but without success—to let him

see it.

The year 1693 is the date of a remarkable work by

the astronomer Edward Halley which deals with life

statistics. Halley noticed also the regularity of the

“boys' rate” (percentage of male births) and other

constancies. He constructed a mortality table, based

on “Bills of Mortality” for the city of Breslau, and a

table of the values of an annuity for every fifth year

of age up to the seventieth.

The application of “chance” in such different do-

mains as games of chance (which received dignity

through the names of Pascal, Fermat, and Huygens)

and mortality impressed the scientific world. Leibniz

himself appreciated the importance of the new science

(as seen in his correspondence with Jakob Bernoulli).

However, he did not contribute to it and he objected

to some of his correspondent's ideas.

*II. JAKOB BERNOULLI AND THE
LAW OF LARGE NUMBERS*

*1.* The theory of probability consists, on the one

hand, of the consideration and formulation of problems,

including techniques for solving them, and on the other

hand, of general theorems. It is the latter kind which

is of primary interest to the historian of thought. The

intriguing aspect of some of these theorems is that

starting with probabilistic assumptions we arrive at

statements of practical certainty. Jakob Bernoulli was

the first to derive such a theorem and it will be worth-

while to sketch the main lines of argument, using,

however, modern terminology in the interest of

expediency.

*2.* We consider a binary alternative (coin tossing;

“ace” or “non-ace” with a die; etc.) to this day called

a Bernoulli trial. If *q* is the “probability of success,”

*p* = 1 - *q* that of “failure,” then the probability of

*a* successes followed by *b* failures in *a* + *b* trials per-

formed with the same die is *qapb.* This result follows

from multiplication laws of independent probabilities

already found and applied by Pascal and Fermat. The

use of laws of addition and multiplication of proba-

bilities is a step beyond the mere counting of combina-

tions. It is based on the realization that a calculus exists

which parallels and reflects the observed relations be-

tween frequencies.

The above probability *qapb* holds for any pattern of

*a* successes and *b* failures: fssfffsf.... Lumping to-

gether all of these, writing *x* for *a* and *a* + *b* = *n,* we

see that *the probability pn*(*x*) of *x successes and n - x
failures regardless of pattern* is

*pn*(

*x*) = (

*nx*)

*qxp*

*n*-

*x*,

*x*= 0, 1, 2, …, n, (II.1)

where (

*nx*) is the number of combinations of

*n*things

in groups of

*x,*and the sum of all

*pn*(

*x*) is 1.

Often we are more interested in the relative number

*z = x/n,* the *frequency* of successes. Then

*pn*(*x*) = *p′n*(*z*) = (*nnz*) *qnzp**n*(*1-z*)

This *p′n*(*z*)—that is, the function that gives to every

abscissa *z* the ordinate *p′n*(*z*)—has a maximum at a point

*zm,* called the *mode* and *zm* is equal to or very close

to *q.* In the vicinity of *zm* the *p′n*(*z*), as function of *n,*

becomes steeper as *n* increases.

*3.* It was Bernoulli's first great idea to consider

increasing values of *n* and a narrow neighborhood of

*q* or, in other words, to investigate the behavior of

*p′n*(*z*) in the neighborhood of *z = q* as *n* increases; this

he did at a time when the interest in the “very large”

and the “very small” was just awakening. Secondly,

he realized that we are not really interested in the

value of *p′n*(*z*) for any particular value *z* but rather in

the total probability belonging to all *z*'s in an interval.

This interval was to contain *q* which, as we remember,

is our original success probability and at the same time

*p′n*(

*z*) (for large

*n*) and likewise its so-called

“mean value.”

Now, with ε a very small number, we call *Pn* the

probability that *z* lies between *q* - ε and *q* + ε, or,

what is the same, that *x* = *nz* lie between *nq* - *n*ε and

*nq* + *n* ε. For this *Pn* one obtains easily the estimate

And from this follows immediately the fundamental

property of *Pn*:

This result can be expressed in words:

*Let q be a given success probability in a single trial:
n trials are performed with the same q and under
conditions of independence. Then, no matter how small
an ε is chosen, as the number n of repetitions increases
indefinitely, the probability Pn that the frequency of
success lie between * (See

*q*- ε and

*q*+ ε, approaches 1.

*Ars conjectandi,*Basel [1713], Part IV, pp. 236-37.)

The above theorem expresses a property of “con-

densation,” namely that with increasing *n* an increasing

proportion of the total probability (which equals 1) is

concentrated in a fixed neighborhood of the original

*q.* The term “probability” as used by Bernoulli in his

computations is always a *ratio* of the number of cases

favorable to an occurrence to the number of all possible

cases. About this great theorem, called today the

“Bernoulli Theorem,” Bernoulli said: “... I had con-

sidered it closely for a period of twenty years, and it

is a problem the novelty of which, as well as its high

utility together with its difficulty adds importance and

weight to all other parts of my doctrine” (ibid.). The

three other parts of the work are likewise very valuable

(but perhaps less from a conceptual point of view).

The second presents the doctrine of combinations. (In

this part Bernoulli also introduces the polynomials

which carry his name.)

*4.* It will be no surprise to the historian of thought

that the admiration we pay to Bernoulli, the mathe-

matician, is not based on his handling of the conceptual

situation. In addition to the above-explained use of a

quotient for a mathematical probability his views are

of the most varied kind, and, obviously, he is not con-

scious of any possible contradiction: “Probability cal-

culus is a general logic of the uncertain.... Probability

is a degree of certainty and differs from certainty as

the part from the whole.... Of two things the one

which owns the greater part of certainty will be the

more probable.... We denote as *ars conjectandi* the

art of measuring (*metiendi*) the probability of things

as precisely as possible.... We estimate the proba

bilities according to the number and the weight (*vis
probandi*) of the reasons for the occurrence of a thing.”

As to this certitude of which probability is a part he

explains that “the certitude of any thing can be con-

sidered

*objectively*and in this sense it relates to the

actual (present, past, or future) existence of the thing

... or

*subjectively*with respect to ourselves and in

this sense it depends on the amount of our knowl-

edge regarding the thing,” and so on. This vague-

ness is in contrast to the modern viewpoint in which,

however, conceptual precision is bought, sometimes

too easily, by completely rejecting uncongenial inter-

pretations.

*5.* There appears in Bernoulli's work another con-

ceptual issue which deals with the dichotomy between

the so-called *direct* and *inverse* problem. The first one

is the type considered above: we know the probability

*q* and make “predictions” about future observations.

In the *inverse* problem we tend to establish from an

observed series of results the parameters of the under-

lying process, e.g., to establish the imperfection of a

die. (The procedures directed at the inverse problem

are today usually handled in mathematical statistics

rather than in probability theory proper.) Bernoulli

himself states that his theorem fails to give results in

very important cases: in the study of games of skill,

in the various problems of life-statistics, in problems

connected with the weather—problems where results

“depend on unknown causes which are interconnected

in unknown ways.”

It is a measure of Bernoulli's insight that he not only

recognized the importance of the inverse problem but

definitely planned (ibid., p. 226) to establish for this

problem a theorem similar to the one we formulated

above. This he did not achieve. It is possible that he

hoped to give a proof of the inverse theorem and that

death intercepted him (Bernoulli's *Ars conjectandi* was

unfinished at the time of his death and was published

only in 1713); or that he was discouraged by critical

remarks of Leibniz regarding inference. It may also

be that he did not distinguish with sufficient clarity

between the two types of problems. For most of his

contemporaries such a distinction did not exist at all;

actually, even an appropriate terminology was lacking.

We owe the first solid progress concerning the inverse

problem to Thomas Bayes. (See Section IV.)

The Bernoulli theorem forms today the very simplest

case of the Laws of Large Numbers (see e.g., R. von

Mises [1964], Ch. IV). The names Poisson, Tchebychev,

Markov, Khintchine, and von Mises should be men-

tioned in this connection. These theorems are also

called “weak” laws of large numbers in contrast to the

more recently established “strong” laws of large num-

bers (due to Borel, Cantelli, Hausdorff, Khintchine,

laws are mainly of mathematical interest.

*III. ABRAHAM DE MOIVRE AND THE
CENTRAL LIMIT THEOREM*

*1.* Shortly after the death of Jakob Bernoulli but

before the publication (1713) of his posthumous work

books of two important mathematicians, P. R. Mont-

mort (1673-1719) and A. de Moivre (1677-1754),

appeared. These were Montmort's *Essai d'analyse sur
les jeux de hasard* (1708 and 1713) and de Moivre's

*De mensura sortis*... (1711) and the

*Doctrine of*

Chances(1718 and 1738). We limit ourselves to a few

Chances

words on the important work of de Moivre.

De Moivre, the first of the great analytic probabilists,

was, as a mathematician, superior to both Jakob

Bernoulli and Montmort. In addition he had the ad-

vantage of being able to use the ideas of Bernoulli and

the algebraic powers of Montmort, which he himself

then developed to an even higher degree. A charming

quotation, taken from the *Doctrine of Chances,* might

be particularly appreciated by the secretary. “For

those of my readers versed in ordinary arithmetic it

would not be difficult to make themselves masters, not

only of the practical rules in this book but also of more

useful discoveries, if they would take the small pains

of being acquainted with the bare notation of algebra,

which might be done in the hundredth part of the time

that is spent in learning to read shorthand.”

*2.* In probability proper de Moivre did basic work

on the “duration of a game,” on “the gambler's ruin,”

and on other subjects still studied today. Of particular

importance is his extension of Bernoulli's theorem

which is really much more than an extension. In Sec-

tion II, 3 we called *Pn* the sum of the 2*r* + 1 middle

terms of *pn*(*x*) where *r* = *n*ε and *pn*(*x*) is given in

Eq.(II.1). In Eq.(II.2) we gave a very simple esti-

mate of *Pn*. (Bernoulli himself had given a sharper

one but it took him ten printed pages of computa-

tion, and to obtain the desired result the estimate

Eq.(II.2) suffices.)

De Moivre, who had a deep admiration for Bernoulli

and his theorem, conceived the very fruitful idea *of
evaluating Pn directly for large values of n,* instead of

estimating it by an inequality. For this purpose one

needs an

*approximation formula for the factorials of*

large numbers.De Moivre derived such a formula,

large numbers.

which coincides essentially with the famous

*Stirling*

formula.He then determined

formula.

*Pn*“by the artifice of

mechanical quadrature.” He computed particular

values of his asymptotic formula for

*Pn*correct to five

decimals. We shall return to these results in the section

on Laplace. Under the name of the

*de Moivre-Laplace*

formula,the result, most important by itself, became

formula,

the starting point of intensive investigations and far-

reaching generalizations which led to what is called

today the central limit theorem of probability calculus

(Section VIII). I. Todhunter, whose work

*A History of*

the Mathematical Theory of Probability... (1865)

the Mathematical Theory of Probability

ends, however, with Laplace, says regarding de Moivre:

“It will not be doubted that the theory of probability

owes more to him than to any other mathematician

with the sole exception of Laplace.” Our discussion

of the work of this great mathematician is compara-

tively brief since his contributions were more on the

mathematical than on the conceptual side. We men-

tion, however, one more instance whose conceptual

importance is obvious: de Moivre seems to have been

the first to denote a probability by one single letter

(like

*p*or

*q,*etc.) rather than as a quotient of two

integers.

*IV. THOMAS BAYES AND
INVERSE PROBABILITY*

*1.* Bayes (1707-61) wrote two basic memoirs, both

published posthumously, in 1763 and 1765, in Vols. 13

and 14 of the *Philosophical Transactions of the Royal
Society of London.* The title of the first one is: “An

Essay Towards Solving a Problem in the Doctrine of

Chances” (1763). A facsimile of both papers (and of

some other relevant material) was issued in 1940 in

Washington, edited by W. E. Deming and E. C. Molina.

The following is from Molina's comments: “In order

to visualize the year 1763 in which the essay was

published let us recall some history.... Euler, then

56 years of age, was sojourning in Berlin under the

patronage of Frederick the Great, to be followed

shortly by Lagrange, then 27; the Marquis de Con-

dorcet, philosopher and mathematician who later ap-

plied Bayes's theorem to problems of testimony, was

but 20 years old.... Laplace, a mere body of 14, had

still 11 years in which to prepare for his

*Mémoires*of

1774, embodying his first ideas on the “probability of

causes,” and had but one year short of half a century

to bring out the first edition of the

*Théorie analytique*

des probabilités(1812) wherein Bayes's theorem

des probabilités

blossomed forth in its most general form.” (See, how-

ever, the end of this section.)

*2.* We explain first the concept of *conditional prob-
ability* introduced by Bayes. Suppose that of a certain

group of people 90% =

*P*(

*A*) own an automobile and

9% =

*P*(

*A,B*) own an automobile and a bicycle. We

call

*P*(

*B*|

*A*) the conditional probability of owning a

bicycle for people who are known to own also a car.

If

*P(A)*≠ 0, then

*P*(

*B*|

*A*) =

*P*(

*A,B*) /

*P*(

*A*)

*by definition the conditional probability of B given*

A.(This will be explained further in Section VII,9.)

A.

In our example

*P*(

*B*|

*A*) = 9/100/90/100 = 1/10 ;

hence,

*P*(

*B*|

*A*. We may write (IV.1) as

*P*(

*A*,

*B*) =

*P*(

*A*) ·

*P*(

*B*|

*A*)

The

*compound probability*of owning both a car and

a bicycle equals the probability of owning a car times

the conditional probability of owning a bicycle, given

that the person owns a car. Of course, the set

*AB*is

a subset of the set

*A.*

*3.* We try now to formulate some kind of inverse

to a Bernoulli problem. (The remainder of this section

may not be easy for a reader not schooled in mathe-

matical thinking. A few rathr subtle distinctions will

be needed; however, the following sections will again

be easier.) Some game is played *n* times and *n*1

“successes” (e.g., *n*1 “aces” in *n* tossings of a die) are

observed. We consider now as known the numbers *n*

and *n*1 (more generally, the statistical result) and would

like to *make some inference* regarding the unknown

success-chance of “ace.” It is quite clear that if we

know nothing but *n* and *n*1 and if these numbers are

small, e.g., *n* = 10, *n*1 = 7, we cannot make any

inference. Denote by *wn*(*x,n*1) the compound proba-

bility that the die has ace-probability *x* and gave *n*1

success out of *n.* Then the conditional probability

of *x,* given *n*1, which we call *q*n(*x*|*n*1) equals by (IV.1):

*qn* (*x*|*n*1) = *wn* (*x,n*1) / ƃ 01 *wn* (*x,n*1) *dx*

Here, *x* is taken as a continuous variable, i.e., it can

take any value between 0 and 1. The ʃ01*wn*(*x,n*1)*dx*

is our *P*(*A*). It is to be replaced by ∑x*w*n(*x*,*n*1) if *x* is

a discrete variable which can, e.g., take on only one

of the 13 values 0, 1/12, 2/12,..., 11/12, 1.

Let us analyze *wn*(*x,n*1). With the notation of Sec. II.1

we obtain
*pn*(*n*1|*x*) = (*nn*1)*x**n*1 (1 - *x*)n-n1
, the con-

ditional probability of *n*1, given that the success chance

(e.g., the chance of ace) has the value *x.* Therefore,

*w*n(*x*,*n*1) = *v*(*x*)*p*n(*n*1|*x*).

Here *v*(*x*) is the *prior* probability or prior chance, the

chance—prior to the present statistical investiga-

tion—that the ace-probability has the value *x.* Sub-

stituting (IV.4) into (IV.3) we have

*qn*(*x*|*n*1) = *v*(*x*)*pn*(*n*1|*x*) / ʃ01 *v*(*x*)*pn*(*n*1|*x*)*dx*, (IV.5)

where, dependent on the problem, the integral in the

denominator may be replaced by a sum. This is Bayes's

“inversion formula.” If we know *v*(*x*) and *p*n(*n*1|*x*) we

can compute *q*n(*x*|*n*1). Clearly, we have to have some

knowledge of *v*(*x*) in order to evaluate Eq.(IV.5). We

note also that the problem must be such that *x* is a

*random variable,* i.e., *that the assumption of many
possible x's which are distributed in a probability
distribution* makes sense (compare end of Section IV,

6, below).

*4.* In some problems *it may be justified to assume
that v*(

*x*)

*be constant,*i.e.,

*that v has the same value*

for all x.(This was so for the geometric problem which

for all x.

Bayes himself considered.) Boole spoke of this assump-

tion as of a case of “equal distribution of ignorance.”

This is not an accurate denotation since often this

assumption is made not out of ignorance but because

it seems adequate. R. A. Fisher argued with much

passion against “Bayes's principle.” However, Bayes

did not have any such principle. He did not start with

a general formula Eq.(IV.5) and then apply a “princi-

ple” by which

*v*(

*x*) could be neglected. He correctly

solved a particular problem. The general formula,

Eq.(IV.5), is due to Laplace.

How about the *v*(*x*) in our original example? Here,

for a body which behaves and looks halfway like a die,

the assumption of constant *v*(*x*) makes no sense. If, e.g.,

we bought our dice at Woolworth's we might take *v*(*x*)

as a curve which differs from 0 only in the neigh-

borhood of *x* = 1/6. If we suppose a loaded die another

*v*(*x*) may be appropriate. The trouble is, of course, that

sometimes we have no way of knowing anything about

*v*(*x*). Before continuing our discussion we review the

facts found so far, regarding Bayes: (a) he was the first

to introduce and use conditional probability; (b) he was

the first to formulate correctly and solve a problem

of inverse probability; (c) he did not consider the gen-

eral problem Eq.(IV.5).

*5.* Regarding *v*(*x*) we may summarize as follows: (a)

if we *can* make an adequate assumption for *v*(*x*) we

can compute *q*n(*x*|*n*1); (b) if we ignore *v*(*x*) and have

no way to assume it and *n* is a small or moderate

number we cannot make an inference; (c) Laplace has

proved (Section V, 6) that *even if we do not know v*(*x*)

*we can make a valid inference if n is large* (and certain

mathematical assumptions for *v*(*x*) are known to hold).

This is not as surprising as it may seem. Clearly, if

we toss a coin 10 times and heads turns up 7 times

and we know nothing else about the coin, an inference

*q*of this coin is unwarranted. If

however, 7,000 heads out of 10,000 turn up then, even

if this is all we know the inference that

*q*> 1/2 and

not very far from 0.7 is very probable. The proof of

(c) is really quite a simple one (see von Mises [1964],

pp. 339ff.) but we cannot give it here. We merely state

here the most important property of the right-hand

side of Eq.(IV.5)—writing now

*qn*(

*x*) instead of

*q*n(

*x*|

*n*1).

Independently of

*v*(

*x*),

*qn*(

*x*)

*shows the property of con-*

densation,as

densation,

*n*increases more and more, a conden-

sation about the observed success frequency

*n*1/

*n*=

*r.*

Indeed the following theorem holds:

If the observation of an*n* times repeated alternative

has shown a frequency r of success, then, if n is suffi-

ciently large, the probability for the unknown success-

chance to lie between *r* - ϵ and *r* + ϵ *is arbitrarily
close to unity.*

This is called *Bayes's theorem,* clearly a kind of

converse of Bernoulli's theorem the observed *r* playing

here the role of the theoretical *q.*

*6.* We consider a closely related problem which

aroused much excitement. Suppose we are in a situa-

tion *where we have the right to assume that v*(*x*) =

constant *holds,* and we know the numbers *n* and *n*1.

By some additional considerations we can then com-

pute the ace-probability *P* itself *as inferred from these
data* (not only the probability

*qn*(

*x*) that

*P*has a certain

value

*x*), and we find that

*P*equals (

*n*1 + 1)/(

*n*+ 2),

and correspondingly 1 -

*P*= (

*n - n*1 + 1)/(

*n*+ 2).

This formula for

*P*is called

*Laplace's rule of succession,*

and it gives well-known senseless results if applied in

an unjustified way. Keynes in his treatise (p. 82) says:

“No other formula in the alchemy of logic has exerted

more astonishing powers. It has established the exist-

ence of God from the basis of total ignorance and it

has measured precisely the probability that the sun

will rise tomorrow.” This magical formula must be

qualified. First of all, if

*n*is small or moderate we may

use the formula

*only if we have good reason to assume*

a constant prior probability.And then it is correct. A

a constant prior probability.

general “Principle of Indifference” is not a “good

reason.” Such a “principle” states that in the absence of

any information one value of a variable is as probable

as another. However, no inference can be based on

ignorance. Second, if

*n*and

*n*1 are both large, then

indeed

*the influence of the a priori knowledge vanishes*

and we need no principle of indifference to justify the

formula. One can, however, still manage to get sense-

less results if the formula is applied to events that are

not random events, for which therefore, the reasoning

and the computations which lead to it are not valid.

This remark concerns, e.g., the joke—coming from

Laplace it can only be considered as a joke—about

using the formula to compute the “probability” that

the sun will rise tomorrow. The rising of the sun does

not depend on chance, and our trust in its rising to-

morrow is founded on astronomy and not on statistical

results.

*7.* We finish with two important remarks. (a) The

idea of inference or inverse probability, the subject of

this section, is not limited to the type of problems

considered here. In our discussion, *p*n(*n*1|*x*) was
(*nn*1)

*x*n1 (1 - *x*n-n1, but formulas like Eq.(IV.5) *can be used
for drawing inferences on the value of an unknown
parameter from v*(

*x*)

*and some pn for the most varied*

pn.This is done in

pn.

*the general theory of inference*

which, according to Richard von Mises and many

others finds a sound basis in the methods explained here

(Mises [1964], Ch. X.). The ideas have also entered

“subjective” probability under the label “Bayesean”

(Lindley, 1965). Regarding the unknown

*v*(

*x*) we say:

(i) if

*n*is large the influence of

*v*(

*x*) vanishes in most

problems; (ii) if

*n*is small, and

*v*(

*x*) unknown it may

still be possible to make some well-founded assumption

regarding

*v*(

*x*) using “past experience” (von Mises

[1964], pp. 498ff.). If no assumption is possible then

no inference can be made. (The problem considered

here was concerned with the posterior chance that the

unknown “ace-probability” has a certain value

*x*or falls

in a certain interval. There are, however, other prob-

lems where such an approach is not called for and

where—similarly as in subsection 6—we mainly want

a good

*estimate*of the unknown magnitude on the basis

of the available data. To reach this aim many different

methods exist. R. A. Fisher advanced the “maximum

likelihood” method which has valuable properties. In

our example, the “maximum likelihood estimate”

equals

*n*1/

*n,*i.e., the observed frequency.)

(b) Like the Bernoulli-de Moivre-Laplace theorem

the Bayes-Laplace theorem has found various exten-

sions and generalizations. Von Mises also envisaged

wide generalizations of both types of Laws of Large

Numbers based on his theory of Statistical Functions

(von Mises [1964], Ch. XII).

*V. PIERRE SIMON, MARQUIS DE LAPLACE:
HIS DEFINITION OF PROBABILITY, LIMIT
THEOREMS, AND THEORY OF ERRORS*

*1.* It has been said that Laplace was not so much

an originator as a man who completed, generalized,

and consummated ideas conceived by others. Be this

as it may, what he left is an enormous treasure. In his

*Théorie analytique des probabilités* (1812) he used the

powerful tools of the new rapidly developing analysis

(The elements of probability calculus—addition, mul-

tiplication, division—were by that time firmly estab-

lished.) Not all of his mathematical results are of equal

interest to the historian of thought.

*2.* We begin with the discussion of his well-known

*definition* of probability as the number of cases favora-

ble to an event divided by the number of all equally

likely cases. (Actually this conception had been used

before Laplace but not as a basic definition.) The

“equally likely cases” are *les cas également possibles,
c'est à dire tels que nous soyons également indécis sur
leur éxistence (Essai philosophique,* p. 4). Thus, for

Laplace, “equally likely” means “equal amount of

indecision,” just as in the notorious “principle of

indifference” (Section IV, 6). In this definition, the

feeling for the empirical side of probability, appearing

at times in the work of Jakob Bernoulli, strongly in

that of Hume and the logicians of Port Royal, seems

to have vanished. The main respect in which the

definition is insufficient is the following. The counting

of equally likely cases works for simple games of

chance (dice, coins). It also applies to important prob-

lems of biology and—surprisingly—of physics. But for

a general definition it is much too narrow as seen by

the simple examples of a biased die, of insurance prob-

abilities, and so on. Laplace himself and his followers

did not hesitate to apply the rules derived by means

of his aprioristic definition to problems like the above

and to many others where the definition failed. Also

in cases where equally likely cases can be defined,

different authors have often obtained different answers

to the same problem (this result was then called a

paradox). The reason is that the authors choose differ-

ent sets of cases as equally likely (Section VI, 8).

Laplace's definition, though not unambiguous and

not sufficiently general, fitted extensive classes of prob-

lems and drew authority from Laplace's great name,

and thus dominated probability theory for at least a

hundred years; it still underlies much of today's think-

ing about probability.

*3.* Laplace's *philosophy* of chance, as exposed in his

*Essai philosophique* is that each phenomenon in the

physical world as well as in social developments is

governed by forces of two kinds; permanent and

accidental. In an isolated phenomenon the effect of

the accidental forces may appear predominant. But,

in the long run, the accidental forces average out and

the permanent ones prevail. This is for Laplace a

consequence of Bernoulli's Law of Large Numbers.

However, while Bernoulli saw very clearly the limita-

tions of his theorem, Laplace applies it to everything

between heaven and earth, including the “favorable

chances tied with the eternal principles of reason,

justice and humanity” or “the natural boundaries of

a state which act as permanent causes,” and so on.

*4.* We have previously mentioned Laplace's contri-

butions to both Bernoulli's and Bayes's problems. It

was de Moivre's (1713) fruitful idea to evaluate *Pn*

(Section III, 2) directly for large *n.* There is no need

to discuss here the precise share of each of the two

mathematicians in the *De Moivre-Laplace formula.*

Todhunter calls this result “one of the most important

in the whole range of our subject.” Hence, for the sake

of those of our readers with some mathematical

schooling we put down the formula. *If a trial where
p*(0) =

*p, p*(1) =

*q, p + q*= 1,

*is repeated n times*

where n is a large number, then the probability Pn that

the number x of successes be between

where n is a large number, then the probability Pn that

the number x of successes be between

*nq*- δ √

*npq*and

*nq*+ δ √

*npq*

*or, what is the same, that the frequency z = x/n of*

success be between

success be between

*q*- δ √

*pq*/

*n*and

*q*+ δ√

*pq*/

*n*(v.1')

*equals asymptotically*

Here, the first term, for which we also write 2Φ(δ),

is twice the famous

*Gauss integral*

or, if δ is considered variable, the celebrated

*normal*

distribution function.For fairly large

distribution function.

*n*the second term

of Eq.(V.2) can be neglected and the first term comes

even for moderate values of δ very close to unity (e.g.,

for δ=3.5 it equals 1 up to five decimals).

*The limits*

in Eq.(

in Eq.

*V.1*′)

*can be rendered as narrow as we please*

by taking n sufficiently large and Pn will always be

larger than2Φ(δ).

by taking n sufficiently large and Pn will always be

larger than

This is the first of the famous *limit theorems of
probability calculus.* Eq.(V.2) exhibits the phenomenon

of

*condensation*(Sections II and IV) about the mid-

point, here the mean value, which means that

*a proba-*

bility arbitrarily close to 1 is contained in an arbitrarily

narrow neighborhood of the mean value.The present

bility arbitrarily close to 1 is contained in an arbitrarily

narrow neighborhood of the mean value.

result goes far beyond Bernoulli's theorem in sharpness

and precision, but conceptually it expresses the same

properties.

*5.* Thus, the distribution of the number *x* of successes

obtained by repetition of a great number of binary

alternatives is asymptotically a normal curve. As pre-

viously indicated more general theorems of this type

hold. If, as always, we denote success by 1, failure by

0, then *x* = *x*1 + *x*2 + ... + *x*n, where each *xi* is either

0 or 1. It is then suggestive to study also cases where

*x*1,

*x*2,...,

*xn*are not as simple

as in the above problem (Section VIII, 2).

*6.* We pass to Laplace's limit theorem for Bayes's

problem. Set (Section IV, 3) *q*(*x*|*n*n and

; let *n* tend towards

infinity while *n*1/*n = r* is kept fixed. The difference

*Qn*(*x*2) - *Qn*(*x*1) is the probability that the object of

our inference (for example, the unknown “ace”-

probability) be between *x*1 and *x*2. Laplace's limit result

looks similar to Eq.(V.1′) and Eq.(V.2). *The probability
that the inferred value lies in the interval*

(

*r*-

*t*√

*r*(1 -

*r*/

*n*,

*r*+

*t*√

*r*(1 -

*r*/

*n*

*tends to*2Φ(

*t*) as

*n*→ ∞. Bayes's theorem (Section IV,

5) follows as a particular case. The most remarkable

feature of this Laplace result is that

*it holds inde-*

pendently of the prior probability.This is proved with-

pendently of the prior probability.

out any sort of “principle of indifference.” This mathe-

matical result corresponds, of course, to the fact that

any prior knowledge regarding the properties of the

die becomes irrelevant if we are in possession of a large

number of results of ad hoc observations.

*7.* To appreciate what now follows we go back for

a moment to our introductory pages in Section I. We

said that the Greek ideal of science was opposed

to the construction of hypotheses on the basis of

empirical data. “The long history of science and phi-

losophy is in large measure the progressive emancipa-

tion of men's minds from the theory of self-evident

truth and from the postulate of complete certainty as

the mark of scientific insight” (Nagel, p. 3).

The end of the eighteenth and the beginning of the

nineteenth century saw the beginnings and develop-

ment of a “theory of errors” developed by the greatest

minds of the time. A long way from the ideal of abso-

lute certitude, scientists are now ready to use observa-

tions, even inaccurate ones. Most observations which

depend on measurements (in the widest sense) *are* liable

to accidental errors. “Exact” measurements exist only

as long as one is satisfied with comparatively crude

results.

*8.* Using the most precise methods available one still

obtains small variations in the results, for example, in

the repeated measurements of the distance of two fixed

points on the surface of the earth. We assume that this

distance *has* some definite “true” value. Let us call

it *a* and it follows that the results *x*1, *x*2,... of several

measurements of the same magnitude must be incorrect

(with the possible exception of one). We call *z*1 =

*x*1 - *a, z*2 = *x*2 - *a,*... the *errors* of measurement.

These errors are considered as *random deviations*

which oscillate around 0. Therefore, there ought to

exist a *law of error,* that is a probability *w*(*z*) of a certain

error *z.*

It is a fascinating mathematical result that, by means

of the so-called “theory of elementary errors” we ob-

tain at once the form of *w*(*z*). This theory, due to Gauss,

assumes that each observation is subject to a large

number of sources of error. Their sum results in the

observed error *z. It follows then at once from the
generalization of the de Moivre-Laplace result* (Section

V, 5, Section VIII, 3)

*that the probability of any result-*

ing error z follows a normal or Gaussian law w(

ing error z follows a normal or Gaussian law w

*z*) =

(

*h*/√π)-h2z2. This

*h,*the so-called

*measure of preci-*

sion,is not determined by this theory. The larger

sion,

*h*

is, the more concentrated is this curve around

*z*= 0.

*9.* The problem remains to determine *the most
probable value of x.* The famous

*method of least squares*

was advanced as a manipulative procedure by

Legendre (1806) and by Gauss (1809). Various attempts

have been made to justify this method by means of

the theory of probability, and here the priority regard-

ing the basic ideas belongs to Laplace. His method was

adopted later (1821-23) by Gauss. The last steps to-

wards today's foundation of the least squares method

are again due to Gauss.

*10.* Any evaluation of Laplace's contribution to the

history of probabilistic thought must mention his deep

interest in the applications. He realized the applica-

bility of probability theory in the most diverse fields

of man's thinking and acting. (Modern physics and

modern biology, replete with probabilistic ideas, did

not exist in Laplace's time.) In his *Mécanique céleste*

Laplace advanced probabilistic theories to explain

astronomical facts. Like Gauss he applied the theory

of errors to astronomical and geodetic operations. He

made various applications of his limit theorems. Of

course, he studied the usual problems of human statis-

tics, insurances, deaths, marriages. He considered

questions concerned with legal matters (which later

formed the main subjects of Poisson's great work). As

soon as Laplace discovered a new method, a new

theorem, he investigated its applicability. This close

connection between theory and meaningful observa-

tional problems—which, in turn, originated new theo-

retical questions—is an unusually attractive feature of

this great mind.

*VI. A TIME OF TRANSITION*

*1.* The influence of the work of Laplace may be

considered under three aspects: (a) his analytical

achievements which deepened and generalized the

results of his predecessors and opened up new avenues;

(b) his definition of probability which seemed to pro-

vide a firm basis for the whole subject; (c) in line with

the rationalistic spirit of the eighteenth century, a wide

field of applications seemed to have been brought

within the domain of reason. Speaking of probability,

*Notre raison cesserait d'être esclave*

de nos impressions(“Our reason would cease to be the

de nos impressions

slave of our impresions”).

*2.* Of the contributions of the great S. D. Poisson

laid down in his *Reherches sur la probabilité des
jugements*... (1837), we mention first a generalization

of James Bernoulli's theorem (Section II). Considered

again is a sequence of binary alternatives—in terms

of repeatedly throwing a die for “ace” or “not-ace”—

Poisson abandoned the condition that all throws must

be carried out with the same or identical dice; he

allowed

*a different die*to be used for each throw. If

*q*(

*n*) denotes

*the arithmetical mean*of the first

*n*ace-

probabilities

*q*1,

*q*2,...,

*qn*then a theorem like

Bernoulli's holds where now

*q*(

*n*) takes the place of the

previously fixed

*q.*Poisson denotes this result as the

Law of Large Numbers. A severe critic like J. M.

Keynes called it “a highly ingenious theorem which

extends widely the applicability of Bernoulli's result.”

To Keynes's regret the condition of independence still

remains. It was removed by Markov (Section VIII, 7).

*3.* Ever since the time of Bernoulli one could ob-

serve the duality between the empirical aspect of

probability (i.e., frequencies) and a mathematical the-

ory, an algebra, that reflected the relations among the

frequencies. Poisson made an important step by stating

this correspondence explicitly. In the Introduction to

his work he says: “In many different fields we observe

empirical phenomena which appear to obey a certain

general law.... This law states that the ratios of

numbers derived from the observation of very many

similar events remain practically constant provided

that the events are governed partly by constant factors

and partly by variable factors whose variations are

irregular and do not cause a systematic change in a

definite direction. Characteristic values of these pro-

portions correspond to the various kinds of events. The

empirical ratios approach these characteristic values

more and more closely the greater the number of

observations.” Poisson called this law again the Law

of Large Numbers. We shall, however, show in detail

in Section VII that this “Law” and the Bernoulli-

Poisson theorem, explained above, are really two

different statements. The sentences quoted above from

Poisson's Introduction together with a great number

of examples make it clear that here Poisson has in mind

a generalization of empirical results. The “ratios” to

which he refers are the frequencies of certain events

in a long series of observations. And the “characteristic

values of the proportions” are the chances of the

events. We shall see that this is essentially the “postu-

late” which von Mises was to introduce as the

empirical basis of frequency theory (Sections VII, 2-4).

*4.* Poisson distinguished between “subjective” and

“objective” probability, calling the latter “chance,” the

former “probability” (a distinction going back to

Aristotle). “An event has by its very nature a *chance,*

small or large, known or unknown, and it has a *proba-
bility* with respect to our knowledge regarding the

event.” We see that we are relinquishing Laplace's

definition in more than one direction.

*5.* Ideas expressed in M. A. A. Cournot's beautifully

written book, *Exposition de la théorie des chances et
des probabilités* (Paris, 1843) are, in several respects

similar to those of Poisson. For Cournot probability

theory deals with certain frequency quotients which

would take on completely determined fixed values if

we could repeat the observations towards infinity. Like

Poisson he discerned a subjective and objective aspect

of probability. “Chance is objective and independent

of the mind which conceives it, and independent of

our restricted knowledge.” Subjective probability may

be estimated according to “the imperfect state of our

knowledge.”

*6.* Almost from the beginning, certainly from the

time of the Bernoullis, it was hoped that probability

would serve as a basis for dealing with problems con-

nected with the “*Sciences Morales.*” Laplace studied

judicial procedures, the credibility of witnesses, the

probability of judgments. And we know that Poisson

was particularly concerned with these questions.

Cournot made legalistic applications *aux documents
statistiques publiés en France par l'Administration de
la Justice.* A very important role in these domains of

thought is to be attributed to the Belgian astronomer

L. A. J. Quételet who visited Paris in 1823 and was

introduced to the mathematicians of

*La grande école*

française,to Laplace, and, in particular, to Poisson.

française,

Between 1823 and 1873 Quételet studied statistical

problems. His

*Physique sociale*of 1869 contains the

construction of the “average man” (

*homme moyen*).

Keynes judged that Quételet “has a fair claim to be

regarded as the parent of modern statistical methods.”

*7.* It is beyond the scope of this article to delve

into statistics. Nevertheless, since Laplace, Poisson,

Cournot, and Quételet have been mentioned with re-

spect to such applications, we have to add the great

name of W. Lexis whose *Theorie der Massenerschei-
nungen in der menschlichen Gesellschaft* (“Theory of

Mass Phenomena in Society”) appeared in 1877. He

was perhaps the first one to attempt an investigation

whether, and to what extent, general series of observa-

tions can be compared with the results of games of

chance and to propose criteria regarding these ques-

tions. In other words, he inaugurated “theoretical sta-

tistics.” His work is of great value with respect to

methods and results.

*8.* We return to probability proper. The great pres-

likely events and actually to the “principle of insuffi-

cient reason” (or briefly “indifference principle”) on

which this concept rests (Section IV, 6). The principle

enters the classical theory in two ways: (a) in Laplace's

definition (Section V, 2) and (b) in the so-called Bayes

principle (Section IV, 4). However, distrust of the

indifference principle kept mounting. It is so easy to

disprove it. We add one particularly striking counter-

example where the results are expressed by continuous

variables.

A glass contains a mixture of wine and water and

we know that the ratio *x* = water/wine lies between

1 and 2 (at least as much water as wine and at most

twice as much water). The Indifference Principle tells

us to assume that to equal parts of the interval (1, 2)

correspond equal probabilities. Hence, the probability

of *x* to lie between 1 and 1.5 is the same as that to

lie between 1.5 and 2. Now let us consider the same

problem in a different way, namely, by using the ratio

*y* = wine/water. On the data, *y* lies between 1/2 and

1, hence by the Indifference Principle, there corre-

sponds to the interval (2/2, 3/4) the same probability as

to (3/4, 1). But if *y* = 3/4, then *x* = 4/3 = 1.333... while

before, the midpoint was at *x* = 1.5. The two results

clearly contradict each other.

With the admiration of the impressive structure

Laplace had erected—supposedly on the basis of his

definition—the question arose how the mathematicians

managed to derive from abstractions results relevant

to experience. Today we know that the valid objections

against Laplace's equally likely cases do not invalidate

the foundations of probability which are not based on

equally likely cases; we also understand better the

relation between foundations and applications.

*9.* One way to a satisfactory foundation was to

abandon the obviously unsatisfactory Laplacean de-

finition and to build a theory based on the empirical

aspect of probability, i.e., on frequencies. Careful

observations led again and again to the assumption that

the “chances” were approached more and more by the

empirical ratios of the frequencies. This conception—

which was definitely favored by Cournot—was fol-

lowed by more or less outspoken statements of

R. L. Ellis, and with the work of J. Venn an explicit

frequency conception of probability emerged. This

theory had a strong influence on C. S. Peirce. In respect

to probability Peirce was “more a philosopher than

a mathematician.” The theory of probability is “the

science of logic quantitatively treated.” In contrast to

today's conceptions (Section VII, 5) the first task of

probability is for him to compute (or approximate) a

probability by the frequencies in a long sequence of

observations; this is “inductive inference.” The prob

lem considered almost exclusively in this article, the

“direct” problem, is his “probable inference.” He

strongly refutes Laplace's definition, and subjective

probability is to be excluded likewise. He has then—

understandably—great difficulty to justify or to deduce

a meaning for the probability of a single event (see

Section IV of Peirce's “Doctrine of Chances”). The

concept of probability as a frequency in Poisson,

Cournot, Ellis, Venn, and Peirce (see also Section VII,

6) appears clearly in von Mises' so-called “first postu-

late” (Section VII, 4). These ideas will be discussed

in the context of the next section.

*VII. FREQUENCY THEORY OF PROBABILITY.
RICHARD VON MISES*

*1.* As stated at the end of Section VI, the tendency

developed of using frequency objective as the basis

of probability theory. L. Ellis, J. Venn, C. S. Peirce,

K. Pearson, et al. embarked on such an empirical

definition of probability (Section VI, 9 and 3). In this

direction, but beyond them in conceptual clarity and

completeness, went Richard von Mises who published

in 1919 an article “Grundlagen der Wahrscheinlich-

keitsrechnung” (*Mathematische Zeitschrift,* 5 [1919],

52-99). Probability theory is considered as a scientific

theory in mathematical form like mechanics or

thermodynamics. Its subjects are *mass phenomena* or

*repeatable events,* as they appear in games of chance,

in insurance problems, in heredity theory, and in the

ever growing domain of applications in physics.

*2.* We remember the conception of Poisson given

in Section VI, 3. Poisson maintains that in many differ-

ent fields of experience a certain *stabilization of rela-
tive frequencies* can be observed as the number of

observations—of the same kind—increases more and

more. He considered this “Law of Large Numbers,”

as he called it, the basis of probability theory. Follow-

ing von Mises, we reserve “Law of Large Numbers”

for the Bernoulli-Poisson theorem (Sections II, and VI,

2), while the above empirical law might be denoted as

Poisson's law.

*3.* The essential feature of the probability concept

built on Poisson's Law is the following. For certain

types of events the outcome of a single observation

is (either in principle or practically) not available, or

not of interest. It may, however, be possible to consider

the single case as embedded in an ensemble of similar

cases and to obtain for this mass phenomenon mean-

ingful global statements. This coincides so far with

Venn's notion. The classical examples are, of course,

the games of chance. If we toss a die once we cannot

predict what the result will be. But if we toss it 10,000

times, we observe the emergence of an increasing con-

stancy of the six frequencies.

A similar situation appears in social problems

(observed under carefully specified conditions) such as

deaths, births, marriages, suicides, etc.; in the “random

motion” of the molecules of a gas; or in the inheritance

of Mendelian characters.

In each of these examples we are concerned with

events whose outcome may differ in one or more re-

spects: color of a certain species of flowers; shape of

the seed; number on the upper face of a die; death

or survival between age 40 and 41 within a precisely

defined group of men; components of the velocity of

a gas molecule under precise conditions, and so on.

For the mass phenomenon, the large group of flowers,

the tosses with the die, the molecules, we use provi-

sionally the term *collective* (see complete definition in

subsection 7, below), and we call *labels,* or simply

results, the mutually exclusive and exhaustive proper-

ties under observation. In Mendel's experiment of the

color of the flower of peas, the labels are the three

colors red, white, pink. If a die is tossed until the 6

appears for the first time with the number of this toss

as result, the labels are the positive integers. If the

components of a velocity vector are observed the

collective is three-dimensional.

*4.* Von Mises assumed like Poisson that to the various

kinds of repetitive events characteristic values corre-

spond which characterize them in respect to the fre-

quency of each label. Take the die experiment: putting

a die into a dice box; shaking the cup; tossing the die.

The labels are, for example, the six numbers 1, 2,...,

6 and it is assumed that there is a characteristic value

corresponding to the frequency of the event “6.” This

value is a *physical constant* of the event (it need, of

course, not be 1/6) and it is measured approximately

by the frequency of “6” in a long sequence of such

tosses and is approached more and more the longer

the sequence of observations. We call it *the probability
of “6”* (Poisson says “chance”)

*within the considered*

collective.If the die is tossed 1,000 times within an

collective.

hour we may notice that the frequency of “6” will

no longer change in the first decimal, and if the experi-

ment is continued for ten hours, three decimals, say,

will remain constant and the fourth will change only

slightly. To get rid of the clumsiness of this statement

von Mises used the concept of

*limit.*If in

*n*tosses the

“6” has turned up

*n*6 times we consider

as the probability of “6” in this collective. Similarly,

a probability exists for the other labels. The definition

(VII.1), which essentially coincides with Poisson's, Ellis'

and Venn's assumptions, is often denoted as

*von Mises'*

first postulate.It is of the same type as one which defines

first postulate.

“velocity” as , where Δ

*s*/Δ

*t*is the ratio of

the displacement of a particle to the time used for it.

*5.* Objections of the type that one cannot make

infinitely many tosses are beside the point. We consider

frequency as an approximate measure of the physical

constant probability, just as we measure temperature

by the extension of the mercury, or density by Δ*m*/Δ*v*

as Δ*v* the volume of the body decreases more and more

(containing always the point at which the density is

measured). It is true that we cannot make infinitely

many tosses. But neither do we have procedures to

construct and measure an infinitely small volume and

actually we cannot measure any physical magnitude

with absolute accuracy. Likewise, an infinitely long,

infinitely thin straight line does not “exist” in our real

world; its home is the boundless emptiness of Euclidean

space. Nevertheless, theories based on such abstract

concepts are fundamental in the study of spatial rela-

tions.

We mention a related viewpoint: as in rational

theories of other areas of knowledge it is not the task

of probability theory to ascertain by a frequency ex-

periment the probability of every conceivable event

to which the concept applies, just as the direct meas-

urement of lengths and angles is not the task of geome-

try. Given probabilities serve as the *initial data* from

which we derive new probabilities by means of the

rules of the calculus of probability. Note also that we

do not imply that in scientific theories probabilities

are necessarily *introduced* by Eq.(VII.1). The famous

probabilities 1/4, 1/2, 1/4 of the simplest case of Mendel's

theory *follow from his theory of heredity* and are then

verified (approximately) by frequency experiments. In

a similar way, other *theories,* notably in physics, *provide
theoretical probability distributions* which are then

verified either directly, or indirectly through their

consequences.

*6.* We have mentioned before that von Mises' con-

ception of a long sequence of observations of the same

kind, and even definition Eq.(VII.1), are not absolutely

new. Similar ideas had been proposed by Ellis, Venn,

and Peirce. Theories of Fechner and of Bruns are

related to the above ideas and so is G. Helm's *Proba-
bility Theory as the Theory of the Concept of Collectives*

(1902). These works did not lead to a complete theory

of probability since they failed to incorporate some

property of a “collective” which would characterize

randomness. To have attempted this is the original and

characteristic feature of von Mises' theory.

*7.* If in the throwing of a coin we denote “heads”

by 1 and “tails” by 0 the sequence of 0's and 1's

be a “random sequence.” It will exhibit an

*irregular*

appearance like 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1,...

and not look like a regular sequence as 0, 1, 0, 1, 0,

1,.... Attempting to characterize a random sequence

von Mises was led to the concept of a

*place selection.*

From an infinite sequence ω:x1, x2 of labels an

infinite subsequence ω′: x′1, x′2 is selected by means

of a rule which determines univocally for every xv of

ω whether or not it appears in ω′. The rule may depend

on the subscript v of x and on the values x1,

From an infinite sequence ω:x1, x2 of labels an

infinite subsequence ω′: x′1, x′2 is selected by means

of a rule which determines univocally for every xv of

ω whether or not it appears in ω′. The rule may depend

on the subscript v of x and on the values x

*x*2,...,

xv - 1

*of terms which precede xv but it must not depend*

on xv itself or on subsequent terms.We call a sequence

on xv itself or on subsequent terms.

ω

*insensitive*to a specific place selection

*s*if the fre-

quency limits of the labels which by Eq.(VII.1) exist

in ω, exist again in ω′ and are the same as in ω. The

simplest place selections are the

*arithmetical*ones

where the decision whether or not xv is selected

depends only on

*v*. “Select

*x*v if

*v*is even.” “Select

*x*v if

*v*is not prime,” etc. Another important type of

selection is to use some of the

*x*'s preceding

*x*v. “Select

*xv*if each preceding term equals 0.” “Select

*xv*if

*v*

is even and three immediately preceding terms are

each equal to 1.” It is clear that such place selections

are “gambling systems” and with this terminology von

Mises'

*second postulate*states that

*for a random se-*

quence no gambling system exists.Sequences satisfying

quence no gambling system exists.

both postulates are called

*collectives*or simply

*random*

sequences.

sequences.

*8.* Von Mises' original formulation (1919, p. 57; see

above, Section VII, 1) seems to imply that he had in

mind insensitivity to all place selections. It can, how-

ever, easily be seen that an unqualified use of the term

“all” or of an equivalent term, leads to contradiction,

a set-theoretical difficulty not noticed by von Mises.

Formulating the second postulate more precisely as

insensitivity to *countably many* place selections the

mathematician A. Wald has shown in 1937 that *the
postulate of randomness in this form together with the
postulate of the existence of frequency limits are con-
sistent.* (If “countably many” is specified in an adequate

sense of mathematical logic we may even say: if one

can explicitly indicate

*one single*place selection which

alters the frequency limit of 0, say, then ω is not a

random sequence.) Wald proved actually much more,

namely, that collectives are, so to speak, the rule. A

particular result:

*almost all*(in a mathematical sense)

*infinite sequences of*0'

*s and*1'

*s have the frequency*

limit1/2

limit

*and exhibit the type of irregularity described*

by the second postulate(

by the second postulate

*von Mises*[1964], Appendix

One; or

*von Mises*[1957], p. 92).

*9.* The concept of sequences which satisfy the two

postulates is only the starting point of the theory. In

his 1931 textbook von Mises has shown that from this

starting point by means of precisely defined *operations*

a comprehensive system of probability theory can be

built. First, the definition yields a reasonable *addition
theorem.* Consider the probability

*P*that

*within one*

and the same collectivea result belonging to either

and the same collective

of two disjoint sets

*A*or

*B*is to occur. The corre-

sponding frequency is in an immediately under-

standable notation (

*nA + nB*)/

*n = nA/n + nB/n*and

by Eq.(VII.1) .

Previous theories that did not use some concept like

frequency “within one and the same collective” could

not be counted on to provide a correct addition

theorem. Indeed the probability of arbitrary “mutually

exclusive” events can have any sum, even greater than

1. We also understand better now the definition of

“conditional probability” introduced in Eq.(IV.1). The

proportion of people who owning an automobile also

own a bicycle clearly equals

*nAB/nA*and if

*n*is the

size of the population under consideration then

*nAB/nA = nAB/n: nA/n,*and if we take the limits as

*n*→ ∞, Eq.(IV.1) follows. By means of these and other

“operations” new random sequences are derived from

given ones.

It is obvious that random sequences are generated

as the results of repeated independent trials. However,

the theory of the collective is by no means limited to

problems of independence. In von Mises (1964, pp.

184-223), under the heading “some problems of non-

independent events,” an outline of a theory of

“arbitrarily linked” (= compatible but dependent)

events is given, followed by applications to Mendelian

heredity where the important concept of a “linkage”

distribution of genes is introduced, and by an introduc-

tion to the theory of “Markov chains,” where the

successive games depend on *n* conditional proba-

bility-distributions, so-called transition probabilities.

All these problems can be considered within the

framework of von Mises' theory. The key to the under-

standing of this apparent contradiction is, in my opin-

ion, the working with more-dimensional collectives;

*p*(*x, y, z*) may well be the probability in a three-

dimensional collective without its being necessarily

equal to *p*1(*x*)*p*2(*y*)*p*3(*z*). If we denote a triple *x, y, z*

by
[Description: Image of Mathematical Expression]
then the sequence ω in the randomness definition

of subsection 7, above, is a sequence
[Description: Image of Equation]
*of such
triples* and the
[Description: Image of Equation]
... occurring in a place selection

are selected by a rule

*for the triples,*while the three

components of the triples can be arbitrarily linked with

each other.

Owing to the initially built-in relations between

basic concepts and observations the theoretical struc-

ture conserves its relation to the real world.

We also note that it is very easy to show that in

cases where Laplace's equally likely cases exist (games

of chance, but also certain problems of biology and

of physics), the von Mises definition reduces to that of

Laplace.

*10.* We finish by discussing Bernoulli's theorem

(Section II) in terms of Laplace's and of von Mises'

definition. Set, for simplicity, *p*(0) = *p*(1) = 1/2. We

have from Eq.(II.1) that *pn*(*x*) = (*nx*) (12)*n*
and the

theorem states that *with increasing n the proportion
of those sequences of length n for which the frequency
of* 0'

*s, n*0/

*n, deviates from*1/2

*by less than ε, approaches*

unity.This formulation corresponds to Laplace's

unity.

definition. Let us consider it more closely. Take

ε = 0.1; then the just-described interval is (0.4, 0.6) and

we denote, as in Section II, by

*Pn*the probability that

the frequency of 0's out of

*n*results (0's and 1's) be

between 0.4 and 0.6. Now compute, for example,

*Pn*

for

*n*= 10. We find easily

*P*10 = 676/1024 = 0.656. That

means in Laplace's sense that of the 210 = 1,024 possi-

ble combinations of two items in groups of ten, 676

have the above property (namely, that for them

*n*0/

*n*

is between 0.4 and 0.6). Likewise we obtain

*P*1000 = 1.000 and with the classical definition this

means that most of the 21000 combinations of two items

in groups of 1,000 have the above property. But since

the days of Bernoulli the result for

*P*1000 has been

interpreted in a different way, saying: “

*If n is large,*

the event under consideration(here 0.4 ≤

the event under consideration

*n*0/

*n*≤)

*will occur almost always.*This

*is an unjustified transi-*

tion from the combinatorial result—which Laplace's

tion from the combinatorial result

theory gives—

*to one about occurrence.*The statement

about occurrence can be justified only by defining “a

coin of probability

*p*for heads”

*in a way which estab-*

lishes from the beginning a connection between p and

the frequency of the occurrence of heads; and one must

then adhere to this connection whenever the term prob-

ability occurs.In von Mises' theory the fact that

lishes from the beginning a connection between p and

the frequency of the occurrence of heads; and one must

then adhere to this connection whenever the term prob-

ability occurs.

*Pn*→ 1

means, of course:

*if groups of n trials are observed very*

often then the frequency of those groups which show

an n0/

often then the frequency of those groups which show

an n

*n very close to p tends towards unity.*This is

the generally accepted meaning of the law of large

numbers and it results only in a frequency theory.

We recognize now also the difference between

Poisson's law and the Law of Large Numbers. The

latter states much more, namely that *the “stabilization”
which according to Poisson's law appears ultimately,
happens in every group of n trials if n is large.* The

reason for this difference is as follows: in von Mises'

theory the law of large numbers follows from Poisson's

law

*plus randomness,*and in the classical theory it

follows from Laplace's definition

*plus the multi*

*plication law.*In both instances it states more than

Poisson's law.

To summarize: (a) if we use Laplace's definition,

Bernoulli's theorem becomes a statement on binomial

coefficients and says nothing about reality; (b) if we

start out with a frequency definition of probability

(equivalent to Poisson's law) and assume in addition

either an adequate multiplication law or randomness,

then Bernoulli's theorem *follows mathematically* and

it has precisely the desired meaning; (c) Bernoulli's

theorem goes beyond Poisson's law; (d) often

Bernoulli's theorem has been used as a “bridge” be-

tween Laplace's definition and frequency statements.

This is not possible, because, as stated in (b) above,

we need a frequency definition in order to derive

Bernoulli's theorem with the correct meaning.

*11.* It would lead us much too far if we went beyond

a mere mentioning of the influential and important

modern statisticians R. A. Fisher, J. Neyman, E. Pear-

son, and others. Their interest is not so much in formu-

lations (both, frequency definition and classical view-

point is used) as in problems of statistical inference

(see the important work of H. Cramér; and von Mises

[1964], Ch. X).

R. Carnap has advanced the concept of a *logical*

probability which means “degree of confirmation” and

which is similar to Keynes's “degree of rational belief.”

He assigns such probabilities also to nonrepeatable

events, and in his opinion it is this logical probability

which most probabilists have in mind. However,

Carnap accepts also the “statistical” or frequency

definition and he speaks of it as “probability2” while

the logical one is “probability1.” Considerations of

space limit us to only mentioning his theory as well

as Reichenbach's idea (similar to Carnap's) of using a

probability calculus to rationalize induction. We agree

with von Mises in the belief that induction, the transi-

tion from observations to theories of a general nature,

cannot be mathematized. Such a transition is not a

logical conclusion but a creative invention regarding

the way to describe groups of observed facts, an inven-

tion which, one hopes, will stand up in the face of

future observations and new ideas. It may, however,

be altered at any time if there are good reasons of an

empirical or conceptual nature.

*VIII. PROBABILITY AS A BRANCH OF
PURE MATHEMATICS*

*1.* The beginning of the twentieth century saw a

splendid development of the mathematics of proba-

bility. A few examples follow which are interesting

from a conceptual point of view.

At the end of Section III and in Section V, 4 we

instance of the so-called Central Limit Theorem. In

Eq.(II.1) we denoted by

*pn*(

*x*) the probability to obtain

in

*n*identical Bernoulli trials

*x*1's and

*n - x*0's, or

equivalently,

*to obtain in these n trials the sum x.*

Denote by

*Qn*(

*x*) the probability to obtain in the

*n*trials

a sum less than or equal to

*x;*then the de Moivre result

is that

*the distribution Qn*(

*x*)

*tends asymptotically to-*

wards a normal distribution.

wards a normal distribution.

*2.* Generalizations of this result might at first go

in two directions. (a) The single game need not be a

simple alternative, and (b) the *n* games need not be

identical. (We do not mention here other generaliza-

tions.) Mathematically: denote by *Vv*(*x*v) the probability

to obtain in the *v*th game a result less than or equal

to *xv*, *v* = 1, 2, ..., *n* (this definition holds for a

“discrete” and a “continuous” distribution—regarding

these concepts remember Section IV, 3). One asks for

the probability *Qn*(*x*) that *Qn( x) that x1 + x2 + ... + xn be less*) which satisfy a mild and

than or equal to x; in particular as n → ∞. The first

general and rigorous theorem of this kind was due to

A. Liapounov in his “Nouvelle forme du théorème sur

la limit de probabilité” (Mémoires de l'Académie des

Sciences, St. Petersbourg, 12 [1901]), who allowed n

different distributions Vv (Yx

easily verifiable restriction. If this “Liapounov condi-

tion” holds,

*Qn*(

*x*)

*is asymptotically normal*just as in

the original de Moivre-Laplace case. In 1922 J. W.

Lindeberg gave necessary and sufficient conditions for

convergence of

*Qn*(

*x*) towards a normal distribution.

*3.* Obviously, this general proposition gives a firm

base to the theory of elementary errors (Section V, 8)

and thus to an important aspect of error-theory. Gauss

applied error theory mainly to geodetic and astronom-

ical measurements. The theory applies, however, to

instances which have nothing to do with “errors of

observations” but rather with fluctuations, with varia-

tions among results, as, for example, in the measure-

ment of the heights of a large number of individuals.

(Many examples may be found in C. V. Charlier,

*Mathematische Statistik*..., Lund, 1920.) Apart from

its various probabilistic applications the Central Limit

Theorem is obviously a remarkable theorem of analysis.

*4.* We turn to considerations which lie in a very

different direction. We remember that in the derivation

of Bernoulli's theorem we used the fundamental con-

cept of probabilistic (or “stochastic”) *independence.*

Independence plays a central role in probability the-

ory. It corresponds to the daily experience that we

may, for example, assume that trials performed in

distant parts of the world do not influence each other.

In the example of independent Bernoulli trials it means

mathematically that the probability of obtaining in *n*

such trials *x* heads and *n - x* tails in a given order

equals *qxpn-x.*

In 1909, É. Borel, the French mathematician, gave

a purely mathematical illustration of independence.

Consider an ordinary decimal fraction, e.g., 0.246.

There exist 1,000 such numbers with three digits, as

0.000, 0.001,..., 0.999. The Laplacean probability

of the particular number 0.246 equals therefore 1/1000,

or (Π denoting “probability”): Π(*d*1 = 2, *d*2 = 4,

*d*3 = 6) = 1/1000,

, where *di* means *i*th decimal digit.

Now obviously: Π (*d*2 = 2) = 1/10, (*d*1 = 2) = 1/10(*d*2 = 4) = 1/10

,
etc. Hence, Π(*d*1 = 2, *d*2 = 4, *d*3 = 6) = Π

(*d*1 = 2) · Π(*d*2 = 4) · Π(*d*3 = 6) and we may then say

with Borel that “*the decimal digits are mutually inde-
pendent.*” The meaning of

*t*= 0.246 is

*t*=

*x*1(

*t*)/10 +

*x*2(

*t*) = 4,

*x*3(

*t*) = 6.

*x*2(

*t*)/100 + ..., where

*x*1(

*t*) = 2,

*x*2(

*t*) = 4,

*x*3(

*t*) = 6.

Now we define analogously the

*binary expansion*of

a decimal fraction

*t*between 0 and 1, namely

*t*= ε1(

*t*)/2 + ε2(

*t*)/4 + ε3(

*t*)/8 + ... where the ε

*i*(

*t*)

are 0 or 1. For example, 1/3 = 0.0101010....

*The*

binary digits are mutually independent;this is proved

binary digits are mutually independent;

just as before for the decimal digits.

Now a sequence 0.ε1ε2ε3... is, on the one hand,

the binary expansion of a number *t* (between 0 and

1) and, on the other hand, a model of the usual game

of tossing heads or tails, if, as always, 0 means tails,

and 1 means heads. Hence this game becomes now *a
mathematical object* to which a calculus can be applied

without getting involved with coins, events, dice and

trials. The existence of such a model was apt to calm

the uneasiness felt by mathematicians and, at the same

time, to stimulate the interest in probability.

*5.* In 1919 von Mises' “Grundlagen” (Section VII,

1) appeared, followed by his books of 1928 and 1931.

His critical evaluation of Laplace's foundations (Sec-

tion V, 2) his distinction between mathematical results

and statements about reality, his introduction of some

basic mathematical concepts (label space, distribution

function, principle of randomness, to mention only a

few) brought about a new interest in the foundations

and, at the same time, pointed a way to an improved

understanding of the applications whose number and

importance kept increasing.

*6.* A few comments on the most important modern

applications of probability which, in turn, strengthened

the mathematics of probability, may seem in order.

We have seen in our consideration of the theory of

errors that, in the world of macro-mechanics, physical

measurements have only a limited accuracy. It was the

aim reached by Laplace and by Gauss to link error

theory to probability theory. A more essential connec-

tion between probability and a physical theory

emerged when statistical mechanics (Clausius, Max-

bilistic interpretation of thermodynamical magni-

tudes; in particular, entropy was given in probabilis-

tic terms, and for the first time a major law of nature

was formulated as a statistical proposition. Striking

success of statistical arguments in the explanation of

physical phenomena appeared in the statistical inter-

pretation of Brownian motion (Einstein, Smoluchovski).

However, the great time of probability in physics is

linked to quantum theory (started by Max Planck,

1900). There, discontinuity is essential (in contrast to

continuity—determinism—differential equations, the

domain of classical physics). In the new microphysics,

differential equations connect probability magnitudes.

Probability permeates the whole world of micro-

physics.

Another important field of application of probability

is genetics. The beginning of our century saw the

reawakening of Mendel's almost forgotten probability

theory of genetics which keeps growing in breadth as

well as in depth.

*7.* We return to probability as a piece of mathe-

matics proper. Early, in Russia, P. L. Chebychev

(1821-94) carried on brilliantly the work of Laplace.

His student, A. A. Markov investigated various aspects

of nonindependent events. In particular, the “Markov

chains,” which play a great role in mathematics as well

as in physics, are still vigourously studied today. The

great time of mathematical probability continued in

Russia and re-emerged in France, and other countries.

Paul Lévy initiated the theory of so-called “stable”

distributions. De Finetti introduced the concept of

“infinitely divisible” distributions, a theory forcefully

developed by P. Lévy, A. N. Kolmogorov, A.

Khintchine, and others. These are but a few examples.

Probability became very attractive to mathematicians,

who felt more and more at home in a subject whose

structure seemed to fit into real analysis, in particular,

measure theory (subsection 8, below).

It became also apparent that methods which proba-

bility had developed lead to results in purely mathe-

matical fields. In M. Kac's book *Statistical Inde-
pendence in Probability, Analysis and Number Theory*

(New York, 1959) chapter headings like “Primes play

a game of chance” or “The Normal Law in number

theory” exhibit connections by their very titles. “Prob-

ability theory,” comments M. Kac (in an article in

*The Mathematical Sciences. A Collection of Essays,*

Cambridge [1969], p. 232), “occupies a unique position

among mathematical disciplines because it has not yet

grown sufficiently old to have severed its natural ties

with problems outside of mathematics proper, while

at the same time it has achieved such maturity of

techniques and concepts it begins to influence other

branches of mathematics.” (This is certainly true for

probability, but it is less certain that it applies

*only*

to probability.)

*8.* The impressive mathematical accomplishments of

probability, along with its growing importance in

scientific thought, led to the realization that a purely

mathematical foundation of sufficient generality, and,

if possible, in axiomatic form, was desirable. Vari-

ous attempts in this direction culminated in A. N.

Kolmogorov's *Grundbegriffe der Wahrscheinlichkeits-
rechnung* (Berlin, 1933). Kolmogorov's aim was to

conceive the basic concepts of probability as ordinary

notions of modern mathematics. The basic analogy is

between “probability” of an “event” and “measure”

of a “set,” where set and measure are taken as general

and abstract concepts.

*Measure* is a generalization of the simple concepts

of “length,” “area,” etc. It applies to point sets which

may be much more general than an interval or the

inside of a square. The generalization from “length”

to “Jordan content” to “Lebesgue measure” is such that

to more and more complicated sets of points a measure

is assigned. In a parallel way the “Cauchy integral”

has been generalized to the “Riemann integral” and

to the “Lebesgue integral.”

*9.* In what precedes, our label space *S* has been a

finite or countable set of points in an interval. For

Kolmogorov, the label space is a general set *S* of “ele-

ments” and *T,* a “field” consisting of subsets of *S,* is

the field of “elementary events” which contains also

*S* and the empty set ∅. To each set *A* of *S* is associated

a nonnegative (*n.n.*) number *P*(*A*) between 0 and 1,

called the *measure* or the *probability* of *A* and *P*(*S*) = 1,

P(∅) = 0. Suppose now first that *T* contains only finitely

many sets. If a subset *A* of *T* is the sum of *n* mutually

exclusive sets *Ai* of *T,* i.e., A = A1 + A2 + ... + An

then it is assumed that *P*(A) = *P*(A1) + *P*(A2) + ... +

*P*(An) and *P* is then called an *additive set function* over

*T.* The above axioms define a *finite probability field.*

An example: *S* is any finite collection of points, e.g.,

1, 2, 3,..., 99, 100. To each of these integers corre-

sponds a nonnegative number *pi* between 0 and 1 such

that the sum of all these *pi* equals 1. *T* consists of *all*

subsets of this *S* and to a set *A* of *T* consisting of the

numbers *i*1, *i*2,..., *ir* the *P*(*A*) is the sum of the *r*

probabilities of these points. This apparently thin

framework is already rather general since *S, T,* and

*P* underlie only the few mentioned formal restrictions.

*10.* Kolmogorov passes to *infinite probability fields,*

where *T* may contain infinitely many sets. If now a

subset *A* of *T* is a sum of countably many disjoint sets

*Ai* of *T,* i.e., A = A1 + A2 + ...,, then it is assumed

that *P*(A) = *P*(A1) + *P*A2 + ... and *P* is called a

*completely additive* or σ-*additive* set function. A so-

*field*is defined in mathematics by the property

that

*all countable sums of sets Ai of the field belong*

likewise to it.It seems desirable to Kolmogorov to

likewise to it.

demand that the σ-additive set functions of probability

calculus be defined on σ-fields. The simplest example

of such an infinite probability field is obtained by taking

for

*S*a countable collection of sets, for example, the

positive integers, and assigning to each a n.n. number

*pi*, such that ∑

*p1*= 1. For

*T*one takes

*all*subsets of

*S*and for a set

*A*of

*T*as its probability

*P*(

*A*) the sum

of the

*pi*of the points which form

*A.*Another most

important example is obtained by choosing a n.n. func-

tion

*f*(

*x*), called

*probability density,*defined in an in-

terval (

*a, b*) [or even in (- ∞, + ∞)] such that

ʃ

*ba f*(

*x*)

*dx*= 1.

*T*is an appropriate collection of sets

*A*in (

*a, b*), for example, the so-called Borel sets, and

P(A) = ∫A

*f*(

*x*)

*dx*. The integrals in these definitions and

computations are Lebesgue integrals.

Such probability fields may now be defined also in

the plane and in three-dimensional, or *n*-dimensional

space.∞

The next generalization concerns *infinitely-dimen-
sional* spaces where one needs a countable number of

coordinates for the definition of each elementary event.

The above indications give an idea of the variety

and generality of Kolmogorov's probability fields. His

axiomatization answered the need for a foundation

adapted to the mathematical aspect of probability. The

loftiness of the structure provides ample room to fill

it with various contents.

*11.* These foundations are not in competition with

those of von Mises. Kolmogorov *axiomatizes the math-
ematical principles of probability calculus,* von Mises

*characterizes probability as an idealized frequency in*

a random sequence.Ideally, they should complement

a random sequence.

each other. However, the integration of the two aspects

is far from trivial (Section IX).

One must also remain conscious of the fact that from

formal definitions and assumptions which the above

axioms offer, only formal conclusions follow, and this

holds no matter how we choose the *S, T,* and *P* of

subsections 9 and 10. In measure theories of probability

the relation to frequency and to randomness is often

introduced as a more or less vague afterthought which

neglects specific difficulties. On the other hand, a

definition like Mises' cannot replace the fixing of the

axiomatic framework and the measure-theoretical

stringency. We shall return to these points of view and

problems in our last section.

*IX. SOME RECENT DEVELOPMENTS*

*1.* In Section VII, 4-8 we introduced and explained

the concept of probability as an *idealized frequency.*

In Section VIII, 8-10 we indicated an *axiomatic set*-

*theoretical framework* of probability theory. We have

seen in this article that these two aspects—frequency

and abstract-mathematical theory—were present from

the seventeenth century on. However, this duality was

not considered disturbing. We have only to think of

Laplace: his aprioristic probability definition, his

mathematics of probability and his work on appli-

cations (for both of which his definition was often not

a sufficient basis) coexisted peacefully for more than

a hundred years although in some respects not consist-

ent with one another. It is only in this century that

the Laplacean framework was found wanting. The

erosion started from both ends: the scientists using

probability and statistics found Laplace's concept

insufficient, and the development of mathematics

greatly outstripped Laplacean rigor. Clarity about prob-

ability as a branch of mathematics, on the one hand,

and of its relation to physical phenomena, on the other

hand, was reached only in the twentieth century. These

two aspects are rightly associated with the names of

Kolmogorov and von Mises.

*2.* It would be a mistake to think that either von

Mises or Kolmogorov negated or were not conscious

of the problems arising from this duality. It might be

more adequate to say that each man considered the

questions connected with the other aspect as somehow

of second order and not in need of strong intellectual

effort on his part. We illustrate this point by examples.

We remember that von Mises' collective is defined

by two postulates: (α) existence of frequency limits,

(β) insensitivity to place selections. His work intro-

duces a wealth of clarifying concepts, also of a purely

mathematical nature, which are used today by most

probabilists. In places, however, mathematical preci-

sion was lacking; we mention two instances.

As the first one we recall the difficulty reported and

discussed in Section VII, 8. The second concerns a gap

that has hardly been referred to by the critics of von

Mises' system, namely that his collective, in its original

form, applied only to the *discrete* label space, a space

consisting of a finite or countable number of points.

A *continuous* label space contains as subsets a wide

variety of *sets of points.* In most, if not in all of his

publications, von Mises does not bother about the

adaption of his theory to general point sets, but con-

siders this an obvious matter once the concept of

collective has been explained. (He spoke, for example,

of “all practically arising sets.”) We shall return to this

matter in subsections 4 and 5 below.

Kolmogorov's set-theoretical foundations were

accepted gladly by the majority of probabilists as the

definitive solution of the problem of foundations of

probability. With respect to the interpretation of his

abstract probability concept Kolmogorov points

ory. However, within the framework of Kolmogorov's

theory this interpretation meets serious difficulties.

Kolmogorov's theory is built on Lebesgue's measure

theory. Now it can be shown that *a frequency inter-
pretation of probability* (whose desirability Kolmogorov

emphasizes)

*is mathematically incompatible with the*

use of Lebesgue's theory.One cannot have it both ways:

use of Lebesgue's theory.

Lebesgue-Kolmogorov generality

*is not consistent with*

a frequency interpretation.

a frequency interpretation.

*3.* Von Mises' label space was too unsophisticated.

Kolmogorov's mathematics is too general to admit

always a frequency interpretation (and no other inter-

pretation is known) of his probability. Analysis of these

shortcomings should lead to a more unified theory. The

following is a report on some attempts in this direction.

As stated in Section VII, 8, Wald has proved—under

certain conditions—the consistency of the concept of

collective. Being both a student of von Mises and of

the set theoretician K. Menger, Wald in the course of

this work could not fail to discern those fields of sets

to which a probability with frequency meaning can

be assigned. Before Wald, E. Tornier, the mathema-

tician, presented an axiomatic structure, different from

both von Mises' and Kolmogorov's, and compatible

with frequency interpretation. H. Geiringer, much

influenced by Tornier and Wald, took a fairly elemen-

tary starting point where concepts like “decidable” and

“verifiable” play a role. (The following paper by

Geiringer is easily accessible and contains all the

quotations, on pp. 6 and 15, of Wald's and Tornier's

works, which are all in German: H. Geiringer, “Proba-

bility Theory of Verifiable Events,” *Archive for Ra-
tional Mechanics and Analysis,* 34 [1969], 3-69.)

*4.* (a) Our eternal die (true or biased) is tossed and

we ask for the probability that in *n* = 100 tosses “ace”

will turn up at least 20 times. The event *A* under

consideration is “at least 20 aces in 100 tosses.” (The

problem is of the type of that of Monsieur de Méré,

discussed in Section I, 3.) The single “trial” consists

of at least 20 and at most 100 tosses. If in such a trial

“ace” turns up at least 20 times we say that the “event”

(or the set) *A* has emerged; otherwise non-*A = A*′

resulted. Clearly, after *each* trial we know with cer-

tainty whether the result is *A* or *A*′. Hence, repeating

the trial *n* times we obtain *nA/n,* the frequency of *A,*

which we take as an approximation to *P*(*A*). Problems

like (a) are strictly *decidable.*

(b) Next remember the elementary concept of a

rational number: it is the quotient of two integers; and

we know that the decimal form of a rational (between

0 and 1, say) is either finite like 0.7, or periodic like

0.333... = 1/3, or 0.142857142857... = 1/7. Call *R*

the set of rationals between 0 and 1 and *R*′ that of

irrationals in this interval. We want a frequency ap-

proximation to *P*(*R*), the “probability of *R.*”

Imagine an urn containing, in equal proportions, lots

with the ten digits 0, 1, 2,..., 9. We draw numbers

out of the urn, note each number before replacing it

and originate in this way a longer and longer decimal

number. The single “trial” consists of as many draws

as needed to decide whether this decimal is rational

or not (belongs to *R* or to *R*′). It is, however, *impossible*

to reach this decision by a finite number of draws—and

we cannot make infinitely many. If after *n* = 10,000

draws a “period” has not emerged it may still emerge

later; if some period seems to have emerged the next

draw could destroy it. *Not one single trial leads to a
decision.* The problem is

*undecidable.*

*5.* In Lebesgue's theory, *R* has a measure (equal to

0). But, assigning this measure ∣R∣ to *R* as its proba-

bility means renouncing any frequency interpretation

of this “probability.” A probability should be “verifi-

able,” i.e., an approximation by means of a frequency

should be in principle possible. But any attempt to

verify ∣R∣ fails. The conclusion (Tornier, Geiringer)

is that to the set *R* (and to *R*′) *no probability can be
assigned* in a frequency theory. This is not a quibble

about words but a genuine and important distinction.

If somebody wants to call ∣R∣ a probability then we

need a new designation like “genuine probability” for

sets like those in (a).

It is easy to characterize mathematically sets like *R*

which have measure but not a verifiable probability.

However, such a description would not be of much

help to the nonmathematician.

(c) *There is a third class of sets which are more
general than* (

*a*)

*but admit verifiable probabilities.*It

is this class of sets which, in von Mises' theory, should

have been added to class (a). Again we have to forego

a mathematical characterization.

*6.* Von Mises dealt exclusively with sets of the

strictly decidable type (a). This, however, does not

imply that a von Mises-probability can be ascribed to

no continuous manifold. Consider, e.g., an interval or

the area of a circle. An area *as a whole is verifiable.*

Imagine a man shooting at a target. By assigning

numbers to concentric parts of the target, beginning

with “1” for the bull's eye including its circular bound-

ary, and ending with the space outside the last ring,

we can characterize each shot by a number and, we

have a problem similar to that of tossing dice.

Similarly, on a straight line a label space may consist,

for example, of the interval between 0 and 10. We

can then speak of the *probability* of the interval (2.5,

3.7) or any other interval in (0, 10). These are problems

way (Section VIII, 10). We ought to understand that

the total interval (0, 1) say,

*has*a probability, but

certain

*point sets*in (0, 1), like

*R*or

*R*′, are of type

(b) and have no probability, although they have

Lebesgue measure. The distinction which we sketched

here very superficially (subsections 4 and 5, above),

shows in what direction von Mises' theory should be

extended beyond its original field and up to certain

limits. But these same bounds should also restrain the

generality of measure theories of probability insofar

as these are to admit frequency interpretation.

*7.* Reviewing the development we can no longer

feel that the measure-theoretical axiomatics of proba-

bility has solved all riddles. It has fulfilled its purpose

to establish probability calculus as a regular branch

of mathematics but it does not help our understanding

of randomness, of degree of certainty, of the Monte

Carlo method, etc.

It thus feels remarkable but understandable that in

1963 Kolmogorov himself again took up the concept

of randomness. He salutes the frequency concept of

probability “the unavoidable nature of which has been

established by von Mises.” He then states that for many

years he was of the opinion that infinite random se-

quences (as used by von Mises) are “not close enough

to reality” while “finite random sequences cannot

admit mathematization.” He has, however, now found

a formalization of finite random sequences and pre-

sents it in this paper. The results are interesting, but, of

necessity rather meager.

Further investigations on random sequences by

R. I. Solomonov, P. Martin Löf, Kolmogorov, G. J.

Chaitin, D. L. Loveland, and, particularly, C. P.

Schnorr are in progress. These investigations (which

are also of interest to other branches of mathematics)

use the concepts and tools of mathematical logic. The

new random sequences point of necessity back to von

Mises' original ideas and some of them study success-

fully the links between the various concepts.

*8.* In this section we have sketched two aspects of

recent development. The first one concerned attempts

to work out the mathematical consequences of the

postulate (or assumption) that a frequency inter-

pretation of probability is possible. This postulate,

basic in von Mises' theory, had been considered by

Kolmogorov as rather obvious and not in need of par-

ticular study. Our second and last subject gave a few

indications regarding the analysis of randomness in

terms of mathematical logic. The problems and results

considered here in our last section seem to point to-

wards a new synthesis of the basic problems of proba-

bility theory.

*BIBLIOGRAPHY*

Jakob (James) Bernoulli, *Ars conjectandi* (Basel, 1713;

Brussels, 1968). R. Carnap, *Logical Foundations of Proba-
bility* (Chicago, 1950). H. Cramér,

*Mathematical Methods*

of Statistics(Princeton, 1946). F. N. Davis,

of Statistics

*Games, Gods,*

and Gambling(New York, 1962). R. L. Ellis,

and Gambling

*On the Foun-*

dations of the Theory of Probability(Cambridge, 1843).

dations of the Theory of Probability

J. M. Keynes,

*A Treatise on Probability*(London, 1921). A. N.

Kolmogorov,

*Grundbegriffe der Wahrscheinlichkeitsrechnung*

(Berlin, 1933). D. V. Lindley,

*Introduction to Probability and*

Statistics from a Bayesian Viewpoint(Cambridge, 1965).

Statistics from a Bayesian Viewpoint

R. von Mises,

*Wahrscheinlichkeit, Statistik und Wahrheit*

(Vienna, 1928); trans. as

*Probability, Statistics and Truth,*

3rd ed. (New York, 1959); idem,

*Wahrscheinlichkeitsrechnung*

und ihre Anwendung in der Statistik und theoretischen

Physik(Vienna, 1931); trans. as

und ihre Anwendung in der Statistik und theoretischen

Physik

*Mathematical Theory of*

Probability and Statistics,ed. and supplemented by Hilda

Probability and Statistics,

Geiringer (New York, 1964). E. Nagel, “Principles of the

Theory of Probability,”

*International Encyclopedia of*

Unified Science(Chicago, 1939), I, 6. C. S. Peirce, “The

Unified Science

Doctrine of Chances,”

*Popular Science Monthly,*12 (1878),

604-15; idem, “A Theory of Probable Inference,” reprinted

in

*collected Papers*(Boston, 1883), II, 433-77. H. Reichen-

bach,

*The Theory of Probability*(Istanbul, 1934; 2nd ed.

Los Angeles, 1949). I. Todhunter,

*A History of the Mathe-*

matical Theory of Probability, From the Time of Pascal to

that of Laplace(Cambridge, 1865; reprint New York, 1931).

matical Theory of Probability, From the Time of Pascal to

that of Laplace

J. Venn,

*The Logic of Chance*(London, 1866). E. T.

Whittaker and G. Robinson,

*The Calculus of Observations*

(London, 1932).

HILDA GEIRINGER

[See also Certainty; Chance; Determinism; Game Theory;Primitivism; Progress

in the Modern Era; Pythagorean...; Rationality; Utopia.]

Dictionary of the History of Ideas | ||