Dictionary of the History of Ideas :: :: University of Virginia Library

1. Games and gambling are as old as human history.
It seems that gambling, a specialty of the human spe-
cies, was spread among virtually all human groups. The
Rig Veda, one of the oldest known poems, mentions
gambling; the Germans of Tacitus' times gambled
heavily, so did the Romans, and so on. All through
history man seems to have been attracted by uncer-
tainty. We can still observe today that as soon as an
“infallible system” of betting is found, the game will
be abandoned or changed to beat the system.

While playing around with chance happenings is
very old, attempts towards any systematic investigation
were slow in coming. Though this may be how most
disciplines develop, there appears to have been a par-
ticular resistance to the systematic investigation of
chance phenomena, which by their very nature seem
opposed to regularity, whereas regularity was generally
considered a necessary condition for the scientific un-
derstanding of any subject.

The Greek conception of science was modelled after
the ideal of Euclidean geometry which is supposedly
derived from a few immediately grasped axioms. It
seems that this rationalistic conception limited philos-
ophers and mathematicians well beyond the Middle
Ages. Friedrich Schiller, in a poem of 1795 says of the
“sage”: Sucht das vertraute Gesetz in des Zufalls
grausenden Wundern/Sucht den ruhenden Pol in der
Erscheinungen Flucht (“Seeks the familiar law in the
dreaded wonders of chance/Looks for the unmoving
pole in the flux of appearances”).

2. However, the hardened gambler, not influenced
by philosophical scruples, could not fail to notice some
sort of long-run regularity in the midst of apparent
irregularity. The use of loaded dice confirms this.

The first “theoretical” work on games of chance
is by Girolamo Cardano (Cardanus), the gambling
scholar: De ludo aleae (written probably around 1560
but not published until 1663). Todhunter describes it
as a kind of “gambler's manual.” Cardano speaks of
chance in terms of the frequency of an event. His
mathematics was influenced by Luca Pacioli.

A contribution by the great Galileo was likewise
stimulated directly by gambling. A friend—probably
the duke of Ferrara—consulted Galileo on the follow-
ing problem. The sums 9 and 10 can be each produced
by three dice, through six different combinations,
namely:
9 = 1 + 2 + 6 = 1 + 3 + 5 = 1 + 4 + 4 = 2+ 2 + 5
= 2 + 3 + 4 = 3 + 3 + 3,
10 = 1 + 3 + 6 = 1 + 4 + 5 = 2 + 2 + 6 = 2 + 3 + 5
= 2 + 4 + 4 = 3 + 3 + 4,
and yet the sum 10 appears more often than the sum
9. Galileo pointed out that in the above enumeration,
for the sum 9, the first, second, and fifth combination
can each appear in 6 ways, the third and fourth in
3 ways, and the last in 1 way; hence, there are alto-
gether 25 ways out of 216 compared to 27 for the sum
10. It is interesting that the “friend” was able to detect
empirically a difference of 1/108 in the frequencies.

3. Of the same type is the well-known question

606

posed to Pascal by a Chevalier de Méré, an inveterate
gambler. It was usual among gamblers to bet even
money that among 4 throws of a true die the “6” would
appear at least once. De Méré concluded that the same
even chance should prevail for the appearance of the
“double 6” in 24 throws (since 6 times 6 is 36 and
4 times 6 is 24). Un problème relatif aux jeux de hasard,
proposé à un austère Janséniste par un homme du
monde a été l'origine du calcul des probabilités (“A
problem in games of chance, proposed to an austere
Jansenist by a man of the world was the origin of
the calculus of probability”), writes S. D. Poisson in
his Recherches sur la probabilité des jugements...
(Paris, 1837). The Chevalier's experiences with the
second type of bet compared unfavorably with those
in the first case. Putting the problem to Blaise Pascal
he accused arithmetic of unreliability. Pascal writes
on this subject to his friend Pierre de Fermat (29 July
1654): Voilà quel était son grand scandale que lui
faisait dire hautement que les propositions [proportions
(?)] n'étaient pas constantes et que l'arithmétique se
démentait (“This was for him a great scandal which
made him say haughtily that the propositions [propor-
tions (?)] are not constant and that arithmetic is self-
contradictory”).

Clearly, this problem is of the same type as that of
Galileo's friend. Again, the remarkable feature is the
gambler's accurate observation of the frequencies.
Pascal's computation might have run as follows. There
are 64 = 1296 different combinations of six signs a, b,
c, d, e, f in groups of four. Of these, 54 = 625 contain
no “a” (no “6”) and, therefore, 1296 - 625 = 671
contain at least one “a,” and 671/1296 = 0.518 = p1 is
the probability for the first bet. A similar computation
gives for the second bet p2 = 0.491, indeed smaller
than p1.

Both Fermat and Pascal, just as had previously
Galileo, found it natural to base their reasoning on
observed frequencies. They were interested in the an-
swers to actual problems and created the simplest
“theory” which was logically sound and explained the
observations.

4. Particularly instructive is another problem exten-
sively discussed in the famous correspondence between
the two eminent mathematicians, the problème des
parties (“problem of points”), which relates to the
question of the just division of the stake between
players if they decide to quit at a moment when neither
has definitely won. Take a simple case. Two players,
A and B, quit at a moment when A needs two points
and B three points to win. Then, reasons Pascal, the
game will certainly be decided in the course of four
more “trials.” He writes down explicitly the combina-
tions which lead to the winning of A, namely aaaa,
aaab, aabb. Here, aaab stands for four different ar-
rangements, namely aaab, aaba,... and similarly aabb
stands for six different arrangements. Hence, 1 + 4 +
6 = 11 arrangements out of 16 lead to the winning
of A and 5 to that of B. The stake should, therefore,
be divided in the ratio 11:5. (It is worthwhile men-
tioning that mathematicians like Roberval and
d'Alembert doubted Pascal's solution.)

The same results were obtained in a slightly different
way by Fermat. The two greatest mathematicians of
their time, Pascal and Fermat, exchanged their dis-
coveries in undisturbed harmony. In the long letter
quoted above, Pascal wrote to Fermat: Je ne doute plus
maintenant que je suis dans la vérité après le rencontre
admirable où je me trouve avec vous.... Je vois bien
que la vérité est la même à Toulouse et à Paris (“I do
not doubt any longer that I have the truth after finding
ourselves in such admirable agreement.... I see that
truth is the same in Toulouse and in Paris”). In connec-
tion with such questions Pascal and Fermat studied
combinations and permutations (Pascal's Traité du tri-
angle arithmétique, 1664) and applied them to various
problems.

5. We venture a few remarks regarding the ideas
on probability of the great philosophers of the seven-
teenth century. “Probability is likeness to be true,” says
Locke. “The grounds of it are in short, these two
following. First, the conformity of anything with our
knowledge, observation, and experience. Secondly, the
testimony of others” (Essay concerning Human Under-
standing, Book IV). This is the empirical viewpoint,
a viewpoint suggested by the observation of gambling
results as well as of deaths, births, and other social
happenings. “But,” continues Keynes “in the meantime
the subject had fallen in the hands of the mathe-
maticians and an entirely new method of approach was
in course of development. It had become obvious that
many of the judgments of probability, which we, in
fact, make do not depend upon past experience in a
way which satisfied the canon laid down by the logi-
cians of Port Royal and by Locke” (“La logique ou
l'art de penser...,” by A. Arnauld, Peter Nicole, and
others, 1662, called the “Port Royal Logic”). As we
have seen, in order to explain observations, the mathe-
maticians created a theory based on the counting of
combinations. The decisive assumption was that the
observed frequency of an event (e.g., of the “9” in
Galileo's problem) be proportional to the corre-
sponding relative number of combinations (there,
25/216).

6. We close our description of the first steps in
probability calculus with one more really great name,
though his fame was not due to his contributions to
our subject: Christian Huygens. Huygens heard through

607

friends, about the problem of points but he had diffi-
culty in obtaining reliable information about the prob-
lem and the methods of the two French mathe-
maticians. Eventually, Carcavi sent him the data as
well as Fermat's solution. Fermat even posed to
Huygens further problems which Huygens worked out
and later included as exercises in a work of his own.
In this work, De ratiociniis in aleae ludi (“On reasoning
in games of chance”) of 1657, he organized all he knew
about the new subject. At the end of the work he
included some questions without indicating the method
of solution. “It seems useful to me to leave something
for my readers to think about (if I have any readers)
and this will serve them both as exercises and as a way
of passing the time.” Jakob (James) Bernoulli gave the
solutions and included them in his Ars conjectandi. The
work of Huygens remained for half a century the
introduction to the “Calculus of Probability.”

7. A related type of investigation concerned mor-
tality and annuities. John Graunt started using the
registers of deaths kept in London since 1592, and
particularly during the years of the great plague. He
used his material to make forecasts on population
trends (Natural and Political Observations... upon
the Bills of Mortality, 1661). He may well be considered
as one of the first statisticians.

John de Witt, grand pensioner of Holland, wrote
on similar questions in 1671 but the precise content
of his work is not known. Leibniz was supposed to
have owned a copy and he was repeatedly asked
by Jakob Bernoulli—but without success—to let him
see it.

The year 1693 is the date of a remarkable work by
the astronomer Edward Halley which deals with life
statistics. Halley noticed also the regularity of the
“boys' rate” (percentage of male births) and other
constancies. He constructed a mortality table, based
on “Bills of Mortality” for the city of Breslau, and a
table of the values of an annuity for every fifth year
of age up to the seventieth.

The application of “chance” in such different do-
mains as games of chance (which received dignity
through the names of Pascal, Fermat, and Huygens)
and mortality impressed the scientific world. Leibniz
himself appreciated the importance of the new science
(as seen in his correspondence with Jakob Bernoulli).
However, he did not contribute to it and he objected
to some of his correspondent's ideas.

II. JAKOB BERNOULLI AND THE
LAW OF LARGE NUMBERS

1. The theory of probability consists, on the one
hand, of the consideration and formulation of problems,
including techniques for solving them, and on the other
hand, of general theorems. It is the latter kind which
is of primary interest to the historian of thought. The
intriguing aspect of some of these theorems is that
starting with probabilistic assumptions we arrive at
statements of practical certainty. Jakob Bernoulli was
the first to derive such a theorem and it will be worth-
while to sketch the main lines of argument, using,
however, modern terminology in the interest of
expediency.

2. We consider a binary alternative (coin tossing;
“ace” or “non-ace” with a die; etc.) to this day called
a Bernoulli trial. If q is the “probability of success,”
p = 1 - q that of “failure,” then the probability of
a successes followed by b failures in a + b trials per-
formed with the same die is qapb. This result follows
from multiplication laws of independent probabilities
already found and applied by Pascal and Fermat. The
use of laws of addition and multiplication of proba-
bilities is a step beyond the mere counting of combina-
tions. It is based on the realization that a calculus exists
which parallels and reflects the observed relations be-
tween frequencies.

The above probability qapb holds for any pattern of
a successes and b failures: fssfffsf.... Lumping to-
gether all of these, writing x for a and a + b = n, we
see that the probability pn(x) of x successes and n - x
failures regardless of pattern is
pn(x) = (nx) qxpn-x, x = 0, 1, 2, …, n, (II.1)
where (nx) is the number of combinations of n things
in groups of x, and the sum of all pn(x) is 1.

Often we are more interested in the relative number
z = x/n, the frequency of successes. Then
pn(x) = p′n(z) = (nnz) qnzpn(1-z)
This p′n(z)—that is, the function that gives to every
abscissa z the ordinate p′n(z)—has a maximum at a point
zm, called the mode and zm is equal to or very close
to q. In the vicinity of zm the p′n(z), as function of n,
becomes steeper as n increases.

3. It was Bernoulli's first great idea to consider
increasing values of n and a narrow neighborhood of
q or, in other words, to investigate the behavior of
p′n(z) in the neighborhood of z = q as n increases; this
he did at a time when the interest in the “very large”
and the “very small” was just awakening. Secondly,
he realized that we are not really interested in the
value of p′n(z) for any particular value z but rather in
the total probability belonging to all z's in an interval.
This interval was to contain q which, as we remember,
is our original success probability and at the same time

608

the mode of p′n(z) (for large n) and likewise its so-called
“mean value.”

Now, with ε a very small number, we call Pn the
probability that z lies between q - ε and q + ε, or,
what is the same, that x = nz lie between nq - nε and
nq + n ε. For this Pn one obtains easily the estimate

And from this follows immediately the fundamental
property of Pn:

This result can be expressed in words:

Let q be a given success probability in a single trial:
n trials are performed with the same q and under
conditions of independence. Then, no matter how small
an ε is chosen, as the number n of repetitions increases
indefinitely, the probability Pn that the frequency of
success lie between q - ε and q + ε, approaches 1. (See
Ars conjectandi, Basel [1713], Part IV, pp. 236-37.)

The above theorem expresses a property of “con-
densation,” namely that with increasing n an increasing
proportion of the total probability (which equals 1) is
concentrated in a fixed neighborhood of the original
q. The term “probability” as used by Bernoulli in his
computations is always a ratio of the number of cases
favorable to an occurrence to the number of all possible
cases. About this great theorem, called today the
“Bernoulli Theorem,” Bernoulli said: “... I had con-
sidered it closely for a period of twenty years, and it
is a problem the novelty of which, as well as its high
utility together with its difficulty adds importance and
weight to all other parts of my doctrine” (ibid.). The
three other parts of the work are likewise very valuable
(but perhaps less from a conceptual point of view).
The second presents the doctrine of combinations. (In
this part Bernoulli also introduces the polynomials
which carry his name.)

4. It will be no surprise to the historian of thought
that the admiration we pay to Bernoulli, the mathe-
matician, is not based on his handling of the conceptual
situation. In addition to the above-explained use of a
quotient for a mathematical probability his views are
of the most varied kind, and, obviously, he is not con-
scious of any possible contradiction: “Probability cal-
culus is a general logic of the uncertain.... Probability
is a degree of certainty and differs from certainty as
the part from the whole.... Of two things the one
which owns the greater part of certainty will be the
more probable.... We denote as ars conjectandi the
art of measuring (metiendi) the probability of things
as precisely as possible.... We estimate the proba
bilities according to the number and the weight (vis
probandi) of the reasons for the occurrence of a thing.”
As to this certitude of which probability is a part he
explains that “the certitude of any thing can be con-
sidered objectively and in this sense it relates to the
actual (present, past, or future) existence of the thing
... or subjectively with respect to ourselves and in
this sense it depends on the amount of our knowl-
edge regarding the thing,” and so on. This vague-
ness is in contrast to the modern viewpoint in which,
however, conceptual precision is bought, sometimes
too easily, by completely rejecting uncongenial inter-
pretations.

5. There appears in Bernoulli's work another con-
ceptual issue which deals with the dichotomy between
the so-called direct and inverse problem. The first one
is the type considered above: we know the probability
q and make “predictions” about future observations.
In the inverse problem we tend to establish from an
observed series of results the parameters of the under-
lying process, e.g., to establish the imperfection of a
die. (The procedures directed at the inverse problem
are today usually handled in mathematical statistics
rather than in probability theory proper.) Bernoulli
himself states that his theorem fails to give results in
very important cases: in the study of games of skill,
in the various problems of life-statistics, in problems
connected with the weather—problems where results
“depend on unknown causes which are interconnected
in unknown ways.”

It is a measure of Bernoulli's insight that he not only
recognized the importance of the inverse problem but
definitely planned (ibid., p. 226) to establish for this
problem a theorem similar to the one we formulated
above. This he did not achieve. It is possible that he
hoped to give a proof of the inverse theorem and that
death intercepted him (Bernoulli's Ars conjectandi was
unfinished at the time of his death and was published
only in 1713); or that he was discouraged by critical
remarks of Leibniz regarding inference. It may also
be that he did not distinguish with sufficient clarity
between the two types of problems. For most of his
contemporaries such a distinction did not exist at all;
actually, even an appropriate terminology was lacking.
We owe the first solid progress concerning the inverse
problem to Thomas Bayes. (See Section IV.)

The Bernoulli theorem forms today the very simplest
case of the Laws of Large Numbers (see e.g., R. von
Mises [1964], Ch. IV). The names Poisson, Tchebychev,
Markov, Khintchine, and von Mises should be men-
tioned in this connection. These theorems are also
called “weak” laws of large numbers in contrast to the
more recently established “strong” laws of large num-
bers (due to Borel, Cantelli, Hausdorff, Khintchine,

609

Kolmogorov) and their generalizations. The “strong”
laws are mainly of mathematical interest.

III. ABRAHAM DE MOIVRE AND THE
CENTRAL LIMIT THEOREM

1. Shortly after the death of Jakob Bernoulli but
before the publication (1713) of his posthumous work
books of two important mathematicians, P. R. Mont-
mort (1673-1719) and A. de Moivre (1677-1754),
appeared. These were Montmort's Essai d'analyse sur
les jeux de hasard (1708 and 1713) and de Moivre's
De mensura sortis... (1711) and the Doctrine of
Chances (1718 and 1738). We limit ourselves to a few
words on the important work of de Moivre.

De Moivre, the first of the great analytic probabilists,
was, as a mathematician, superior to both Jakob
Bernoulli and Montmort. In addition he had the ad-
vantage of being able to use the ideas of Bernoulli and
the algebraic powers of Montmort, which he himself
then developed to an even higher degree. A charming
quotation, taken from the Doctrine of Chances, might
be particularly appreciated by the secretary. “For
those of my readers versed in ordinary arithmetic it
would not be difficult to make themselves masters, not
only of the practical rules in this book but also of more
useful discoveries, if they would take the small pains
of being acquainted with the bare notation of algebra,
which might be done in the hundredth part of the time
that is spent in learning to read shorthand.”

2. In probability proper de Moivre did basic work
on the “duration of a game,” on “the gambler's ruin,”
and on other subjects still studied today. Of particular
importance is his extension of Bernoulli's theorem
which is really much more than an extension. In Sec-
tion II, 3 we called Pn the sum of the 2r + 1 middle
terms of pn(x) where r = nε and pn(x) is given in
Eq.(II.1). In Eq.(II.2) we gave a very simple esti-
mate of Pn. (Bernoulli himself had given a sharper
one but it took him ten printed pages of computa-
tion, and to obtain the desired result the estimate
Eq.(II.2) suffices.)

De Moivre, who had a deep admiration for Bernoulli
and his theorem, conceived the very fruitful idea of
evaluating Pn directly for large values of n, instead of
estimating it by an inequality. For this purpose one
needs an approximation formula for the factorials of
large numbers. De Moivre derived such a formula,
which coincides essentially with the famous Stirling
formula. He then determined Pn “by the artifice of
mechanical quadrature.” He computed particular
values of his asymptotic formula for Pn correct to five
decimals. We shall return to these results in the section
on Laplace. Under the name of the de Moivre-Laplace
formula, the result, most important by itself, became
the starting point of intensive investigations and far-
reaching generalizations which led to what is called
today the central limit theorem of probability calculus
(Section VIII). I. Todhunter, whose work A History of
the Mathematical Theory of Probability... (1865)
ends, however, with Laplace, says regarding de Moivre:
“It will not be doubted that the theory of probability
owes more to him than to any other mathematician
with the sole exception of Laplace.” Our discussion
of the work of this great mathematician is compara-
tively brief since his contributions were more on the
mathematical than on the conceptual side. We men-
tion, however, one more instance whose conceptual
importance is obvious: de Moivre seems to have been
the first to denote a probability by one single letter
(like p or q, etc.) rather than as a quotient of two
integers.

IV. THOMAS BAYES AND
INVERSE PROBABILITY

1. Bayes (1707-61) wrote two basic memoirs, both
published posthumously, in 1763 and 1765, in Vols. 13
and 14 of the Philosophical Transactions of the Royal
Society of London. The title of the first one is: “An
Essay Towards Solving a Problem in the Doctrine of
Chances” (1763). A facsimile of both papers (and of
some other relevant material) was issued in 1940 in
Washington, edited by W. E. Deming and E. C. Molina.
The following is from Molina's comments: “In order
to visualize the year 1763 in which the essay was
published let us recall some history.... Euler, then
56 years of age, was sojourning in Berlin under the
patronage of Frederick the Great, to be followed
shortly by Lagrange, then 27; the Marquis de Con-
dorcet, philosopher and mathematician who later ap-
plied Bayes's theorem to problems of testimony, was
but 20 years old.... Laplace, a mere body of 14, had
still 11 years in which to prepare for his Mémoires of
1774, embodying his first ideas on the “probability of
causes,” and had but one year short of half a century
to bring out the first edition of the Théorie analytique
des probabilités (1812) wherein Bayes's theorem
blossomed forth in its most general form.” (See, how-
ever, the end of this section.)

2. We explain first the concept of conditional prob-
ability introduced by Bayes. Suppose that of a certain
group of people 90% = P(A) own an automobile and
9% = P(A,B) own an automobile and a bicycle. We
call P(B|A) the conditional probability of owning a
bicycle for people who are known to own also a car.
If P(A) ≠ 0, then
P(B|A) = P(A,B) / P(A)

610

is by definition the conditional probability of B given
A. (This will be explained further in Section VII,9.)
In our example
P(B|A) = 9/100/90/100 = 1/10 ;
hence, P(B|A. We may write (IV.1) as
P(A,B) = P(A) · P(B|A)
The compound probability of owning both a car and
a bicycle equals the probability of owning a car times
the conditional probability of owning a bicycle, given
that the person owns a car. Of course, the set AB is
a subset of the set A.

3. We try now to formulate some kind of inverse
to a Bernoulli problem. (The remainder of this section
may not be easy for a reader not schooled in mathe-
matical thinking. A few rathr subtle distinctions will
be needed; however, the following sections will again
be easier.) Some game is played n times and n1
“successes” (e.g., n1 “aces” in n tossings of a die) are
observed. We consider now as known the numbers n
and n1 (more generally, the statistical result) and would
like to make some inference regarding the unknown
success-chance of “ace.” It is quite clear that if we
know nothing but n and n1 and if these numbers are
small, e.g., n = 10, n1 = 7, we cannot make any
inference. Denote by wn(x,n1) the compound proba-
bility that the die has ace-probability x and gave n1
success out of n. Then the conditional probability
of x, given n1, which we call qn(x|n1) equals by (IV.1):
qn (x|n1) = wn (x,n1) / ƃ 01 wn (x,n1) dx
Here, x is taken as a continuous variable, i.e., it can
take any value between 0 and 1. The ʃ01wn(x,n1)dx
is our P(A). It is to be replaced by ∑xwn(x,n1) if x is
a discrete variable which can, e.g., take on only one
of the 13 values 0, 1/12, 2/12,..., 11/12, 1.

Let us analyze wn(x,n1). With the notation of Sec. II.1
we obtain pn(n1|x) = (nn1)xn1 (1 - x)n-n1 , the con-
ditional probability of n1, given that the success chance
(e.g., the chance of ace) has the value x. Therefore,
wn(x,n1) = v(x)pn(n1|x).
Here v(x) is the prior probability or prior chance, the
chance—prior to the present statistical investiga-
tion—that the ace-probability has the value x. Sub-
stituting (IV.4) into (IV.3) we have
qn(x|n1) = v(x)pn(n1|x) / ʃ01 v(x)pn(n1|x)dx, (IV.5)
where, dependent on the problem, the integral in the
denominator may be replaced by a sum. This is Bayes's
“inversion formula.” If we know v(x) and pn(n1|x) we
can compute qn(x|n1). Clearly, we have to have some
knowledge of v(x) in order to evaluate Eq.(IV.5). We
note also that the problem must be such that x is a
random variable, i.e., that the assumption of many
possible x's which are distributed in a probability
distribution makes sense (compare end of Section IV,
6, below).

4. In some problems it may be justified to assume
that v(x) be constant, i.e., that v has the same value
for all x. (This was so for the geometric problem which
Bayes himself considered.) Boole spoke of this assump-
tion as of a case of “equal distribution of ignorance.”
This is not an accurate denotation since often this
assumption is made not out of ignorance but because
it seems adequate. R. A. Fisher argued with much
passion against “Bayes's principle.” However, Bayes
did not have any such principle. He did not start with
a general formula Eq.(IV.5) and then apply a “princi-
ple” by which v(x) could be neglected. He correctly
solved a particular problem. The general formula,
Eq.(IV.5), is due to Laplace.

How about the v(x) in our original example? Here,
for a body which behaves and looks halfway like a die,
the assumption of constant v(x) makes no sense. If, e.g.,
we bought our dice at Woolworth's we might take v(x)
as a curve which differs from 0 only in the neigh-
borhood of x = 1/6. If we suppose a loaded die another
v(x) may be appropriate. The trouble is, of course, that
sometimes we have no way of knowing anything about
v(x). Before continuing our discussion we review the
facts found so far, regarding Bayes: (a) he was the first
to introduce and use conditional probability; (b) he was
the first to formulate correctly and solve a problem
of inverse probability; (c) he did not consider the gen-
eral problem Eq.(IV.5).

5. Regarding v(x) we may summarize as follows: (a)
if we can make an adequate assumption for v(x) we
can compute qn(x|n1); (b) if we ignore v(x) and have
no way to assume it and n is a small or moderate
number we cannot make an inference; (c) Laplace has
proved (Section V, 6) that even if we do not know v(x)
we can make a valid inference if n is large (and certain
mathematical assumptions for v(x) are known to hold).
This is not as surprising as it may seem. Clearly, if
we toss a coin 10 times and heads turns up 7 times
and we know nothing else about the coin, an inference

611

on the head-chance q of this coin is unwarranted. If
however, 7,000 heads out of 10,000 turn up then, even
if this is all we know the inference that q > 1/2 and
not very far from 0.7 is very probable. The proof of
(c) is really quite a simple one (see von Mises [1964],
pp. 339ff.) but we cannot give it here. We merely state
here the most important property of the right-hand
side of Eq.(IV.5)—writing now qn(x) instead of qn(x|n1).
Independently of v(x), qn(x) shows the property of con-
densation, as n increases more and more, a conden-
sation about the observed success frequency n1/n = r.
Indeed the following theorem holds:

If the observation of ann times repeated alternative
has shown a frequency r of success, then, if n is suffi-
ciently large, the probability for the unknown success-
chance to lie between r - ϵ and r + ϵ is arbitrarily
close to unity.

This is called Bayes's theorem, clearly a kind of
converse of Bernoulli's theorem the observed r playing
here the role of the theoretical q.

6. We consider a closely related problem which
aroused much excitement. Suppose we are in a situa-
tion where we have the right to assume that v(x) =
constant holds, and we know the numbers n and n1.
By some additional considerations we can then com-
pute the ace-probability P itself as inferred from these
data (not only the probability qn(x) that P has a certain
value x), and we find that P equals (n1 + 1)/(n + 2),
and correspondingly 1 - P = (n - n1 + 1)/(n + 2).
This formula for P is called Laplace's rule of succession,
and it gives well-known senseless results if applied in
an unjustified way. Keynes in his treatise (p. 82) says:
“No other formula in the alchemy of logic has exerted
more astonishing powers. It has established the exist-
ence of God from the basis of total ignorance and it
has measured precisely the probability that the sun
will rise tomorrow.” This magical formula must be
qualified. First of all, if n is small or moderate we may
use the formula only if we have good reason to assume
a constant prior probability. And then it is correct. A
general “Principle of Indifference” is not a “good
reason.” Such a “principle” states that in the absence of
any information one value of a variable is as probable
as another. However, no inference can be based on
ignorance. Second, if n and n1 are both large, then
indeed the influence of the a priori knowledge vanishes
and we need no principle of indifference to justify the
formula. One can, however, still manage to get sense-
less results if the formula is applied to events that are
not random events, for which therefore, the reasoning
and the computations which lead to it are not valid.
This remark concerns, e.g., the joke—coming from
Laplace it can only be considered as a joke—about
using the formula to compute the “probability” that
the sun will rise tomorrow. The rising of the sun does
not depend on chance, and our trust in its rising to-
morrow is founded on astronomy and not on statistical
results.

7. We finish with two important remarks. (a) The
idea of inference or inverse probability, the subject of
this section, is not limited to the type of problems
considered here. In our discussion, pn(n1|x) was (nn1)
xn1 (1 - xn-n1, but formulas like Eq.(IV.5) can be used
for drawing inferences on the value of an unknown
parameter from v(x) and some pn for the most varied
pn. This is done in the general theory of inference
which, according to Richard von Mises and many
others finds a sound basis in the methods explained here
(Mises [1964], Ch. X.). The ideas have also entered
“subjective” probability under the label “Bayesean”
(Lindley, 1965). Regarding the unknown v(x) we say:
(i) if n is large the influence of v(x) vanishes in most
problems; (ii) if n is small, and v(x) unknown it may
still be possible to make some well-founded assumption
regarding v(x) using “past experience” (von Mises
[1964], pp. 498ff.). If no assumption is possible then
no inference can be made. (The problem considered
here was concerned with the posterior chance that the
unknown “ace-probability” has a certain value x or falls
in a certain interval. There are, however, other prob-
lems where such an approach is not called for and
where—similarly as in subsection 6—we mainly want
a good estimate of the unknown magnitude on the basis
of the available data. To reach this aim many different
methods exist. R. A. Fisher advanced the “maximum
likelihood” method which has valuable properties. In
our example, the “maximum likelihood estimate”
equals n1/n, i.e., the observed frequency.)

(b) Like the Bernoulli-de Moivre-Laplace theorem
the Bayes-Laplace theorem has found various exten-
sions and generalizations. Von Mises also envisaged
wide generalizations of both types of Laws of Large
Numbers based on his theory of Statistical Functions
(von Mises [1964], Ch. XII).

V. PIERRE SIMON, MARQUIS DE LAPLACE:
HIS DEFINITION OF PROBABILITY, LIMIT
THEOREMS, AND THEORY OF ERRORS

1. It has been said that Laplace was not so much
an originator as a man who completed, generalized,
and consummated ideas conceived by others. Be this
as it may, what he left is an enormous treasure. In his
Théorie analytique des probabilités (1812) he used the
powerful tools of the new rapidly developing analysis

612

to build a comprehensive system of probability theory.
(The elements of probability calculus—addition, mul-
tiplication, division—were by that time firmly estab-
lished.) Not all of his mathematical results are of equal
interest to the historian of thought.

2. We begin with the discussion of his well-known
definition of probability as the number of cases favora-
ble to an event divided by the number of all equally
likely cases. (Actually this conception had been used
before Laplace but not as a basic definition.) The
“equally likely cases” are les cas également possibles,
c'est à dire tels que nous soyons également indécis sur
leur éxistence (Essai philosophique, p. 4). Thus, for
Laplace, “equally likely” means “equal amount of
indecision,” just as in the notorious “principle of
indifference” (Section IV, 6). In this definition, the
feeling for the empirical side of probability, appearing
at times in the work of Jakob Bernoulli, strongly in
that of Hume and the logicians of Port Royal, seems
to have vanished. The main respect in which the
definition is insufficient is the following. The counting
of equally likely cases works for simple games of
chance (dice, coins). It also applies to important prob-
lems of biology and—surprisingly—of physics. But for
a general definition it is much too narrow as seen by
the simple examples of a biased die, of insurance prob-
abilities, and so on. Laplace himself and his followers
did not hesitate to apply the rules derived by means
of his aprioristic definition to problems like the above
and to many others where the definition failed. Also
in cases where equally likely cases can be defined,
different authors have often obtained different answers
to the same problem (this result was then called a
paradox). The reason is that the authors choose differ-
ent sets of cases as equally likely (Section VI, 8).

Laplace's definition, though not unambiguous and
not sufficiently general, fitted extensive classes of prob-
lems and drew authority from Laplace's great name,
and thus dominated probability theory for at least a
hundred years; it still underlies much of today's think-
ing about probability.

3. Laplace's philosophy of chance, as exposed in his
Essai philosophique is that each phenomenon in the
physical world as well as in social developments is
governed by forces of two kinds; permanent and
accidental. In an isolated phenomenon the effect of
the accidental forces may appear predominant. But,
in the long run, the accidental forces average out and
the permanent ones prevail. This is for Laplace a
consequence of Bernoulli's Law of Large Numbers.
However, while Bernoulli saw very clearly the limita-
tions of his theorem, Laplace applies it to everything
between heaven and earth, including the “favorable
chances tied with the eternal principles of reason,
justice and humanity” or “the natural boundaries of
a state which act as permanent causes,” and so on.

4. We have previously mentioned Laplace's contri-
butions to both Bernoulli's and Bayes's problems. It
was de Moivre's (1713) fruitful idea to evaluate Pn
(Section III, 2) directly for large n. There is no need
to discuss here the precise share of each of the two
mathematicians in the De Moivre-Laplace formula.
Todhunter calls this result “one of the most important
in the whole range of our subject.” Hence, for the sake
of those of our readers with some mathematical
schooling we put down the formula. If a trial where
p(0) = p, p(1) = q, p + q = 1, is repeated n times
where n is a large number, then the probability Pn that
the number x of successes be between
nq - δ √npq and nq + δ √npq
or, what is the same, that the frequency z = x/n of
success be between
q - δ √pq/n and q + δ√pq / n (v.1')
equals asymptotically

Here, the first term, for which we also write 2Φ(δ),
is twice the famous Gauss integral

or, if δ is considered variable, the celebrated normal
distribution function. For fairly large n the second term
of Eq.(V.2) can be neglected and the first term comes
even for moderate values of δ very close to unity (e.g.,
for δ=3.5 it equals 1 up to five decimals). The limits
in Eq.(V.1′) can be rendered as narrow as we please
by taking n sufficiently large and Pn will always be
larger than 2Φ(δ).

This is the first of the famous limit theorems of
probability calculus. Eq.(V.2) exhibits the phenomenon
of condensation (Sections II and IV) about the mid-
point, here the mean value, which means that a proba-
bility arbitrarily close to 1 is contained in an arbitrarily
narrow neighborhood of the mean value. The present
result goes far beyond Bernoulli's theorem in sharpness
and precision, but conceptually it expresses the same
properties.

5. Thus, the distribution of the number x of successes
obtained by repetition of a great number of binary
alternatives is asymptotically a normal curve. As pre-
viously indicated more general theorems of this type
hold. If, as always, we denote success by 1, failure by
0, then x = x1 + x2 + ... + xn, where each xi is either
0 or 1. It is then suggestive to study also cases where

613

the distributions of the x1, x2,..., xn are not as simple
as in the above problem (Section VIII, 2).

6. We pass to Laplace's limit theorem for Bayes's
problem. Set (Section IV, 3) q(x|nn and
; let n tend towards
infinity while n1/n = r is kept fixed. The difference
Qn(x2) - Qn(x1) is the probability that the object of
our inference (for example, the unknown “ace”-
probability) be between x1 and x2. Laplace's limit result
looks similar to Eq.(V.1′) and Eq.(V.2). The probability
that the inferred value lies in the interval
(r - t √r(1 - r/n, r + t √ r(1 - r/n
tends to 2Φ(t) as n → ∞. Bayes's theorem (Section IV,
5) follows as a particular case. The most remarkable
feature of this Laplace result is that it holds inde-
pendently of the prior probability. This is proved with-
out any sort of “principle of indifference.” This mathe-
matical result corresponds, of course, to the fact that
any prior knowledge regarding the properties of the
die becomes irrelevant if we are in possession of a large
number of results of ad hoc observations.

7. To appreciate what now follows we go back for
a moment to our introductory pages in Section I. We
said that the Greek ideal of science was opposed
to the construction of hypotheses on the basis of
empirical data. “The long history of science and phi-
losophy is in large measure the progressive emancipa-
tion of men's minds from the theory of self-evident
truth and from the postulate of complete certainty as
the mark of scientific insight” (Nagel, p. 3).

The end of the eighteenth and the beginning of the
nineteenth century saw the beginnings and develop-
ment of a “theory of errors” developed by the greatest
minds of the time. A long way from the ideal of abso-
lute certitude, scientists are now ready to use observa-
tions, even inaccurate ones. Most observations which
depend on measurements (in the widest sense) are liable
to accidental errors. “Exact” measurements exist only
as long as one is satisfied with comparatively crude
results.

8. Using the most precise methods available one still
obtains small variations in the results, for example, in
the repeated measurements of the distance of two fixed
points on the surface of the earth. We assume that this
distance has some definite “true” value. Let us call
it a and it follows that the results x1, x2,... of several
measurements of the same magnitude must be incorrect
(with the possible exception of one). We call z1 =
x1 - a, z2 = x2 - a,... the errors of measurement.
These errors are considered as random deviations
which oscillate around 0. Therefore, there ought to
exist a law of error, that is a probability w(z) of a certain
error z.

It is a fascinating mathematical result that, by means
of the so-called “theory of elementary errors” we ob-
tain at once the form of w(z). This theory, due to Gauss,
assumes that each observation is subject to a large
number of sources of error. Their sum results in the
observed error z. It follows then at once from the
generalization of the de Moivre-Laplace result (Section
V, 5, Section VIII, 3) that the probability of any result-
ing error z follows a normal or Gaussian law w(z) =
(h/√π)-h2z2. This h, the so-called measure of preci-
sion, is not determined by this theory. The larger h
is, the more concentrated is this curve around z = 0.

9. The problem remains to determine the most
probable value of x. The famous method of least squares
was advanced as a manipulative procedure by
Legendre (1806) and by Gauss (1809). Various attempts
have been made to justify this method by means of
the theory of probability, and here the priority regard-
ing the basic ideas belongs to Laplace. His method was
adopted later (1821-23) by Gauss. The last steps to-
wards today's foundation of the least squares method
are again due to Gauss.

10. Any evaluation of Laplace's contribution to the
history of probabilistic thought must mention his deep
interest in the applications. He realized the applica-
bility of probability theory in the most diverse fields
of man's thinking and acting. (Modern physics and
modern biology, replete with probabilistic ideas, did
not exist in Laplace's time.) In his Mécanique céleste
Laplace advanced probabilistic theories to explain
astronomical facts. Like Gauss he applied the theory
of errors to astronomical and geodetic operations. He
made various applications of his limit theorems. Of
course, he studied the usual problems of human statis-
tics, insurances, deaths, marriages. He considered
questions concerned with legal matters (which later
formed the main subjects of Poisson's great work). As
soon as Laplace discovered a new method, a new
theorem, he investigated its applicability. This close
connection between theory and meaningful observa-
tional problems—which, in turn, originated new theo-
retical questions—is an unusually attractive feature of
this great mind.

VI. A TIME OF TRANSITION

1. The influence of the work of Laplace may be
considered under three aspects: (a) his analytical
achievements which deepened and generalized the
results of his predecessors and opened up new avenues;
(b) his definition of probability which seemed to pro-
vide a firm basis for the whole subject; (c) in line with
the rationalistic spirit of the eighteenth century, a wide
field of applications seemed to have been brought
within the domain of reason. Speaking of probability,

614

Condorcet wrote: Notre raison cesserait d'être esclave
de nos impressions (“Our reason would cease to be the
slave of our impresions”).

2. Of the contributions of the great S. D. Poisson
laid down in his Reherches sur la probabilité des
jugements... (1837), we mention first a generalization
of James Bernoulli's theorem (Section II). Considered
again is a sequence of binary alternatives—in terms
of repeatedly throwing a die for “ace” or “not-ace”—
Poisson abandoned the condition that all throws must
be carried out with the same or identical dice; he
allowed a different die to be used for each throw. If
q(n) denotes the arithmetical mean of the first n ace-
probabilities q1, q2,..., qn then a theorem like
Bernoulli's holds where now q(n) takes the place of the
previously fixed q. Poisson denotes this result as the
Law of Large Numbers. A severe critic like J. M.
Keynes called it “a highly ingenious theorem which
extends widely the applicability of Bernoulli's result.”
To Keynes's regret the condition of independence still
remains. It was removed by Markov (Section VIII, 7).

3. Ever since the time of Bernoulli one could ob-
serve the duality between the empirical aspect of
probability (i.e., frequencies) and a mathematical the-
ory, an algebra, that reflected the relations among the
frequencies. Poisson made an important step by stating
this correspondence explicitly. In the Introduction to
his work he says: “In many different fields we observe
empirical phenomena which appear to obey a certain
general law.... This law states that the ratios of
numbers derived from the observation of very many
similar events remain practically constant provided
that the events are governed partly by constant factors
and partly by variable factors whose variations are
irregular and do not cause a systematic change in a
definite direction. Characteristic values of these pro-
portions correspond to the various kinds of events. The
empirical ratios approach these characteristic values
more and more closely the greater the number of
observations.” Poisson called this law again the Law
of Large Numbers. We shall, however, show in detail
in Section VII that this “Law” and the Bernoulli-
Poisson theorem, explained above, are really two
different statements. The sentences quoted above from
Poisson's Introduction together with a great number
of examples make it clear that here Poisson has in mind
a generalization of empirical results. The “ratios” to
which he refers are the frequencies of certain events
in a long series of observations. And the “characteristic
values of the proportions” are the chances of the
events. We shall see that this is essentially the “postu-
late” which von Mises was to introduce as the
empirical basis of frequency theory (Sections VII, 2-4).

4. Poisson distinguished between “subjective” and
“objective” probability, calling the latter “chance,” the
former “probability” (a distinction going back to
Aristotle). “An event has by its very nature a chance,
small or large, known or unknown, and it has a proba-
bility with respect to our knowledge regarding the
event.” We see that we are relinquishing Laplace's
definition in more than one direction.

5. Ideas expressed in M. A. A. Cournot's beautifully
written book, Exposition de la théorie des chances et
des probabilités (Paris, 1843) are, in several respects
similar to those of Poisson. For Cournot probability
theory deals with certain frequency quotients which
would take on completely determined fixed values if
we could repeat the observations towards infinity. Like
Poisson he discerned a subjective and objective aspect
of probability. “Chance is objective and independent
of the mind which conceives it, and independent of
our restricted knowledge.” Subjective probability may
be estimated according to “the imperfect state of our
knowledge.”

6. Almost from the beginning, certainly from the
time of the Bernoullis, it was hoped that probability
would serve as a basis for dealing with problems con-
nected with the “Sciences Morales.” Laplace studied
judicial procedures, the credibility of witnesses, the
probability of judgments. And we know that Poisson
was particularly concerned with these questions.
Cournot made legalistic applications aux documents
statistiques publiés en France par l'Administration de
la Justice. A very important role in these domains of
thought is to be attributed to the Belgian astronomer
L. A. J. Quételet who visited Paris in 1823 and was
introduced to the mathematicians of La grande école
française, to Laplace, and, in particular, to Poisson.
Between 1823 and 1873 Quételet studied statistical
problems. His Physique sociale of 1869 contains the
construction of the “average man” (homme moyen).
Keynes judged that Quételet “has a fair claim to be
regarded as the parent of modern statistical methods.”

7. It is beyond the scope of this article to delve
into statistics. Nevertheless, since Laplace, Poisson,
Cournot, and Quételet have been mentioned with re-
spect to such applications, we have to add the great
name of W. Lexis whose Theorie der Massenerschei-
nungen in der menschlichen Gesellschaft (“Theory of
Mass Phenomena in Society”) appeared in 1877. He
was perhaps the first one to attempt an investigation
whether, and to what extent, general series of observa-
tions can be compared with the results of games of
chance and to propose criteria regarding these ques-
tions. In other words, he inaugurated “theoretical sta-
tistics.” His work is of great value with respect to
methods and results.

8. We return to probability proper. The great pres-

615

tige of Laplace gave support to his concept of equally
likely events and actually to the “principle of insuffi-
cient reason” (or briefly “indifference principle”) on
which this concept rests (Section IV, 6). The principle
enters the classical theory in two ways: (a) in Laplace's
definition (Section V, 2) and (b) in the so-called Bayes
principle (Section IV, 4). However, distrust of the
indifference principle kept mounting. It is so easy to
disprove it. We add one particularly striking counter-
example where the results are expressed by continuous
variables.

A glass contains a mixture of wine and water and
we know that the ratio x = water/wine lies between
1 and 2 (at least as much water as wine and at most
twice as much water). The Indifference Principle tells
us to assume that to equal parts of the interval (1, 2)
correspond equal probabilities. Hence, the probability
of x to lie between 1 and 1.5 is the same as that to
lie between 1.5 and 2. Now let us consider the same
problem in a different way, namely, by using the ratio
y = wine/water. On the data, y lies between 1/2 and
1, hence by the Indifference Principle, there corre-
sponds to the interval (2/2, 3/4) the same probability as
to (3/4, 1). But if y = 3/4, then x = 4/3 = 1.333... while
before, the midpoint was at x = 1.5. The two results
clearly contradict each other.

With the admiration of the impressive structure
Laplace had erected—supposedly on the basis of his
definition—the question arose how the mathematicians
managed to derive from abstractions results relevant
to experience. Today we know that the valid objections
against Laplace's equally likely cases do not invalidate
the foundations of probability which are not based on
equally likely cases; we also understand better the
relation between foundations and applications.

9. One way to a satisfactory foundation was to
abandon the obviously unsatisfactory Laplacean de-
finition and to build a theory based on the empirical
aspect of probability, i.e., on frequencies. Careful
observations led again and again to the assumption that
the “chances” were approached more and more by the
empirical ratios of the frequencies. This conception—
which was definitely favored by Cournot—was fol-
lowed by more or less outspoken statements of
R. L. Ellis, and with the work of J. Venn an explicit
frequency conception of probability emerged. This
theory had a strong influence on C. S. Peirce. In respect
to probability Peirce was “more a philosopher than
a mathematician.” The theory of probability is “the
science of logic quantitatively treated.” In contrast to
today's conceptions (Section VII, 5) the first task of
probability is for him to compute (or approximate) a
probability by the frequencies in a long sequence of
observations; this is “inductive inference.” The prob
lem considered almost exclusively in this article, the
“direct” problem, is his “probable inference.” He
strongly refutes Laplace's definition, and subjective
probability is to be excluded likewise. He has then—
understandably—great difficulty to justify or to deduce
a meaning for the probability of a single event (see
Section IV of Peirce's “Doctrine of Chances”). The
concept of probability as a frequency in Poisson,
Cournot, Ellis, Venn, and Peirce (see also Section VII,
6) appears clearly in von Mises' so-called “first postu-
late” (Section VII, 4). These ideas will be discussed
in the context of the next section.

VII. FREQUENCY THEORY OF PROBABILITY.
RICHARD VON MISES

1. As stated at the end of Section VI, the tendency
developed of using frequency objective as the basis
of probability theory. L. Ellis, J. Venn, C. S. Peirce,
K. Pearson, et al. embarked on such an empirical
definition of probability (Section VI, 9 and 3). In this
direction, but beyond them in conceptual clarity and
completeness, went Richard von Mises who published
in 1919 an article “Grundlagen der Wahrscheinlich-
keitsrechnung” (Mathematische Zeitschrift, 5 [1919],
52-99). Probability theory is considered as a scientific
theory in mathematical form like mechanics or
thermodynamics. Its subjects are mass phenomena or
repeatable events, as they appear in games of chance,
in insurance problems, in heredity theory, and in the
ever growing domain of applications in physics.

2. We remember the conception of Poisson given
in Section VI, 3. Poisson maintains that in many differ-
ent fields of experience a certain stabilization of rela-
tive frequencies can be observed as the number of
observations—of the same kind—increases more and
more. He considered this “Law of Large Numbers,”
as he called it, the basis of probability theory. Follow-
ing von Mises, we reserve “Law of Large Numbers”
for the Bernoulli-Poisson theorem (Sections II, and VI,
2), while the above empirical law might be denoted as
Poisson's law.

3. The essential feature of the probability concept
built on Poisson's Law is the following. For certain
types of events the outcome of a single observation
is (either in principle or practically) not available, or
not of interest. It may, however, be possible to consider
the single case as embedded in an ensemble of similar
cases and to obtain for this mass phenomenon mean-
ingful global statements. This coincides so far with
Venn's notion. The classical examples are, of course,
the games of chance. If we toss a die once we cannot
predict what the result will be. But if we toss it 10,000
times, we observe the emergence of an increasing con-
stancy of the six frequencies.

616

A similar situation appears in social problems
(observed under carefully specified conditions) such as
deaths, births, marriages, suicides, etc.; in the “random
motion” of the molecules of a gas; or in the inheritance
of Mendelian characters.

In each of these examples we are concerned with
events whose outcome may differ in one or more re-
spects: color of a certain species of flowers; shape of
the seed; number on the upper face of a die; death
or survival between age 40 and 41 within a precisely
defined group of men; components of the velocity of
a gas molecule under precise conditions, and so on.
For the mass phenomenon, the large group of flowers,
the tosses with the die, the molecules, we use provi-
sionally the term collective (see complete definition in
subsection 7, below), and we call labels, or simply
results, the mutually exclusive and exhaustive proper-
ties under observation. In Mendel's experiment of the
color of the flower of peas, the labels are the three
colors red, white, pink. If a die is tossed until the 6
appears for the first time with the number of this toss
as result, the labels are the positive integers. If the
components of a velocity vector are observed the
collective is three-dimensional.

4. Von Mises assumed like Poisson that to the various
kinds of repetitive events characteristic values corre-
spond which characterize them in respect to the fre-
quency of each label. Take the die experiment: putting
a die into a dice box; shaking the cup; tossing the die.
The labels are, for example, the six numbers 1, 2,...,
6 and it is assumed that there is a characteristic value
corresponding to the frequency of the event “6.” This
value is a physical constant of the event (it need, of
course, not be 1/6) and it is measured approximately
by the frequency of “6” in a long sequence of such
tosses and is approached more and more the longer
the sequence of observations. We call it the probability
of “6” (Poisson says “chance”) within the considered
collective. If the die is tossed 1,000 times within an
hour we may notice that the frequency of “6” will
no longer change in the first decimal, and if the experi-
ment is continued for ten hours, three decimals, say,
will remain constant and the fourth will change only
slightly. To get rid of the clumsiness of this statement
von Mises used the concept of limit. If in n tosses the
“6” has turned up n6 times we consider

as the probability of “6” in this collective. Similarly,
a probability exists for the other labels. The definition
(VII.1), which essentially coincides with Poisson's, Ellis'
and Venn's assumptions, is often denoted as von Mises'
first postulate. It is of the same type as one which defines
“velocity” as , where Δs/Δt is the ratio of
the displacement of a particle to the time used for it.

5. Objections of the type that one cannot make
infinitely many tosses are beside the point. We consider
frequency as an approximate measure of the physical
constant probability, just as we measure temperature
by the extension of the mercury, or density by Δm/Δv
as Δv the volume of the body decreases more and more
(containing always the point at which the density is
measured). It is true that we cannot make infinitely
many tosses. But neither do we have procedures to
construct and measure an infinitely small volume and
actually we cannot measure any physical magnitude
with absolute accuracy. Likewise, an infinitely long,
infinitely thin straight line does not “exist” in our real
world; its home is the boundless emptiness of Euclidean
space. Nevertheless, theories based on such abstract
concepts are fundamental in the study of spatial rela-
tions.

We mention a related viewpoint: as in rational
theories of other areas of knowledge it is not the task
of probability theory to ascertain by a frequency ex-
periment the probability of every conceivable event
to which the concept applies, just as the direct meas-
urement of lengths and angles is not the task of geome-
try. Given probabilities serve as the initial data from
which we derive new probabilities by means of the
rules of the calculus of probability. Note also that we
do not imply that in scientific theories probabilities
are necessarily introduced by Eq.(VII.1). The famous
probabilities 1/4, 1/2, 1/4 of the simplest case of Mendel's
theory follow from his theory of heredity and are then
verified (approximately) by frequency experiments. In
a similar way, other theories, notably in physics, provide
theoretical probability distributions which are then
verified either directly, or indirectly through their
consequences.

6. We have mentioned before that von Mises' con-
ception of a long sequence of observations of the same
kind, and even definition Eq.(VII.1), are not absolutely
new. Similar ideas had been proposed by Ellis, Venn,
and Peirce. Theories of Fechner and of Bruns are
related to the above ideas and so is G. Helm's Proba-
bility Theory as the Theory of the Concept of Collectives
(1902). These works did not lead to a complete theory
of probability since they failed to incorporate some
property of a “collective” which would characterize
randomness. To have attempted this is the original and
characteristic feature of von Mises' theory.

7. If in the throwing of a coin we denote “heads”
by 1 and “tails” by 0 the sequence of 0's and 1's

617

generated by the repeated throwing of the coin will
be a “random sequence.” It will exhibit an irregular
appearance like 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1,...
and not look like a regular sequence as 0, 1, 0, 1, 0,
1,.... Attempting to characterize a random sequence
von Mises was led to the concept of a place selection.
From an infinite sequence ω:x1, x2 of labels an
infinite subsequence ω′: x′1, x′2 is selected by means
of a rule which determines univocally for every xv of
ω whether or not it appears in ω′. The rule may depend
on the subscript v of x and on the values x1, x2,...,
xv - 1 of terms which precede xv but it must not depend
on xv itself or on subsequent terms. We call a sequence
ω insensitive to a specific place selection s if the fre-
quency limits of the labels which by Eq.(VII.1) exist
in ω, exist again in ω′ and are the same as in ω. The
simplest place selections are the arithmetical ones
where the decision whether or not xv is selected
depends only on v. “Select xv if v is even.” “Select
xv if v is not prime,” etc. Another important type of
selection is to use some of the x's preceding xv. “Select
xv if each preceding term equals 0.” “Select xv if v
is even and three immediately preceding terms are
each equal to 1.” It is clear that such place selections
are “gambling systems” and with this terminology von
Mises' second postulate states that for a random se-
quence no gambling system exists. Sequences satisfying
both postulates are called collectives or simply random
sequences.

8. Von Mises' original formulation (1919, p. 57; see
above, Section VII, 1) seems to imply that he had in
mind insensitivity to all place selections. It can, how-
ever, easily be seen that an unqualified use of the term
“all” or of an equivalent term, leads to contradiction,
a set-theoretical difficulty not noticed by von Mises.
Formulating the second postulate more precisely as
insensitivity to countably many place selections the
mathematician A. Wald has shown in 1937 that the
postulate of randomness in this form together with the
postulate of the existence of frequency limits are con-
sistent. (If “countably many” is specified in an adequate
sense of mathematical logic we may even say: if one
can explicitly indicate one single place selection which
alters the frequency limit of 0, say, then ω is not a
random sequence.) Wald proved actually much more,
namely, that collectives are, so to speak, the rule. A
particular result: almost all (in a mathematical sense)
infinite sequences of 0's and 1's have the frequency
limit 1/2 and exhibit the type of irregularity described
by the second postulate (von Mises [1964], Appendix
One; or von Mises [1957], p. 92).

9. The concept of sequences which satisfy the two
postulates is only the starting point of the theory. In
his 1931 textbook von Mises has shown that from this
starting point by means of precisely defined operations
a comprehensive system of probability theory can be
built. First, the definition yields a reasonable addition
theorem. Consider the probability P that within one
and the same collective a result belonging to either
of two disjoint sets A or B is to occur. The corre-
sponding frequency is in an immediately under-
standable notation (nA + nB)/n = nA/n + nB/n and
by Eq.(VII.1) .
Previous theories that did not use some concept like
frequency “within one and the same collective” could
not be counted on to provide a correct addition
theorem. Indeed the probability of arbitrary “mutually
exclusive” events can have any sum, even greater than
1. We also understand better now the definition of
“conditional probability” introduced in Eq.(IV.1). The
proportion of people who owning an automobile also
own a bicycle clearly equals nAB/nA and if n is the
size of the population under consideration then
nAB/nA = nAB/n: nA/n, and if we take the limits as
n → ∞, Eq.(IV.1) follows. By means of these and other
“operations” new random sequences are derived from
given ones.

It is obvious that random sequences are generated
as the results of repeated independent trials. However,
the theory of the collective is by no means limited to
problems of independence. In von Mises (1964, pp.
184-223), under the heading “some problems of non-
independent events,” an outline of a theory of
“arbitrarily linked” (= compatible but dependent)
events is given, followed by applications to Mendelian
heredity where the important concept of a “linkage”
distribution of genes is introduced, and by an introduc-
tion to the theory of “Markov chains,” where the
successive games depend on n conditional proba-
bility-distributions, so-called transition probabilities.
All these problems can be considered within the
framework of von Mises' theory. The key to the under-
standing of this apparent contradiction is, in my opin-
ion, the working with more-dimensional collectives;
p(x, y, z) may well be the probability in a three-
dimensional collective without its being necessarily
equal to p1(x)p2(y)p3(z). If we denote a triple x, y, z
by [Description: Image of Mathematical Expression] then the sequence ω in the randomness definition
of subsection 7, above, is a sequence [Description: Image of Equation] of such
triples and the [Description: Image of Equation] ... occurring in a place selection
are selected by a rule for the triples, while the three
components of the triples can be arbitrarily linked with
each other.

Owing to the initially built-in relations between
basic concepts and observations the theoretical struc-
ture conserves its relation to the real world.

618

We also note that it is very easy to show that in
cases where Laplace's equally likely cases exist (games
of chance, but also certain problems of biology and
of physics), the von Mises definition reduces to that of
Laplace.

10. We finish by discussing Bernoulli's theorem
(Section II) in terms of Laplace's and of von Mises'
definition. Set, for simplicity, p(0) = p(1) = 1/2. We
have from Eq.(II.1) that pn(x) = (nx) (12)n and the
theorem states that with increasing n the proportion
of those sequences of length n for which the frequency
of 0's, n0/n, deviates from 1/2 by less than ε, approaches
unity. This formulation corresponds to Laplace's
definition. Let us consider it more closely. Take
ε = 0.1; then the just-described interval is (0.4, 0.6) and
we denote, as in Section II, by Pn the probability that
the frequency of 0's out of n results (0's and 1's) be
between 0.4 and 0.6. Now compute, for example, Pn
for n = 10. We find easily P10 = 676/1024 = 0.656. That
means in Laplace's sense that of the 210 = 1,024 possi-
ble combinations of two items in groups of ten, 676
have the above property (namely, that for them n0/n
is between 0.4 and 0.6). Likewise we obtain
P1000 = 1.000 and with the classical definition this
means that most of the 21000 combinations of two items
in groups of 1,000 have the above property. But since
the days of Bernoulli the result for P1000 has been
interpreted in a different way, saying: “If n is large,
the event under consideration (here 0.4 ≤ n0/n ≤)
will occur almost always. This is an unjustified transi-
tion from the combinatorial result—which Laplace's
theory gives—to one about occurrence. The statement
about occurrence can be justified only by defining “a
coin of probability p for heads” in a way which estab-
lishes from the beginning a connection between p and
the frequency of the occurrence of heads; and one must
then adhere to this connection whenever the term prob-
ability occurs. In von Mises' theory the fact that Pn → 1
means, of course: if groups of n trials are observed very
often then the frequency of those groups which show
an n0/n very close to p tends towards unity. This is
the generally accepted meaning of the law of large
numbers and it results only in a frequency theory.

We recognize now also the difference between
Poisson's law and the Law of Large Numbers. The
latter states much more, namely that the “stabilization”
which according to Poisson's law appears ultimately,
happens in every group of n trials if n is large. The
reason for this difference is as follows: in von Mises'
theory the law of large numbers follows from Poisson's
law plus randomness, and in the classical theory it
follows from Laplace's definition plus the multi
plication law. In both instances it states more than
Poisson's law.

To summarize: (a) if we use Laplace's definition,
Bernoulli's theorem becomes a statement on binomial
coefficients and says nothing about reality; (b) if we
start out with a frequency definition of probability
(equivalent to Poisson's law) and assume in addition
either an adequate multiplication law or randomness,
then Bernoulli's theorem follows mathematically and
it has precisely the desired meaning; (c) Bernoulli's
theorem goes beyond Poisson's law; (d) often
Bernoulli's theorem has been used as a “bridge” be-
tween Laplace's definition and frequency statements.
This is not possible, because, as stated in (b) above,
we need a frequency definition in order to derive
Bernoulli's theorem with the correct meaning.

11. It would lead us much too far if we went beyond
a mere mentioning of the influential and important
modern statisticians R. A. Fisher, J. Neyman, E. Pear-
son, and others. Their interest is not so much in formu-
lations (both, frequency definition and classical view-
point is used) as in problems of statistical inference
(see the important work of H. Cramér; and von Mises
[1964], Ch. X).

R. Carnap has advanced the concept of a logical
probability which means “degree of confirmation” and
which is similar to Keynes's “degree of rational belief.”
He assigns such probabilities also to nonrepeatable
events, and in his opinion it is this logical probability
which most probabilists have in mind. However,
Carnap accepts also the “statistical” or frequency
definition and he speaks of it as “probability2” while
the logical one is “probability1.” Considerations of
space limit us to only mentioning his theory as well
as Reichenbach's idea (similar to Carnap's) of using a
probability calculus to rationalize induction. We agree
with von Mises in the belief that induction, the transi-
tion from observations to theories of a general nature,
cannot be mathematized. Such a transition is not a
logical conclusion but a creative invention regarding
the way to describe groups of observed facts, an inven-
tion which, one hopes, will stand up in the face of
future observations and new ideas. It may, however,
be altered at any time if there are good reasons of an
empirical or conceptual nature.

VIII. PROBABILITY AS A BRANCH OF
PURE MATHEMATICS

1. The beginning of the twentieth century saw a
splendid development of the mathematics of proba-
bility. A few examples follow which are interesting
from a conceptual point of view.

At the end of Section III and in Section V, 4 we

619

discussed the de Moivre-Laplace formula, the first
instance of the so-called Central Limit Theorem. In
Eq.(II.1) we denoted by pn(x) the probability to obtain
in n identical Bernoulli trials x 1's and n - x 0's, or
equivalently, to obtain in these n trials the sum x.
Denote by Qn(x) the probability to obtain in the n trials
a sum less than or equal to x; then the de Moivre result
is that the distribution Qn(x) tends asymptotically to-
wards a normal distribution.

2. Generalizations of this result might at first go
in two directions. (a) The single game need not be a
simple alternative, and (b) the n games need not be
identical. (We do not mention here other generaliza-
tions.) Mathematically: denote by Vv(xv) the probability
to obtain in the vth game a result less than or equal
to xv, v = 1, 2, ..., n (this definition holds for a
“discrete” and a “continuous” distribution—regarding
these concepts remember Section IV, 3). One asks for
the probability Qn(x) that Qn(x) that x1 + x2 + ... + xn be less
than or equal to x; in particular as n → ∞. The first
general and rigorous theorem of this kind was due to
A. Liapounov in his “Nouvelle forme du théorème sur
la limit de probabilité” (Mémoires de l'Académie des
Sciences, St. Petersbourg, 12 [1901]), who allowed n
different distributions Vv (Yx) which satisfy a mild and
easily verifiable restriction. If this “Liapounov condi-
tion” holds, Qn(x) is asymptotically normal just as in
the original de Moivre-Laplace case. In 1922 J. W.
Lindeberg gave necessary and sufficient conditions for
convergence of Qn(x) towards a normal distribution.

3. Obviously, this general proposition gives a firm
base to the theory of elementary errors (Section V, 8)
and thus to an important aspect of error-theory. Gauss
applied error theory mainly to geodetic and astronom-
ical measurements. The theory applies, however, to
instances which have nothing to do with “errors of
observations” but rather with fluctuations, with varia-
tions among results, as, for example, in the measure-
ment of the heights of a large number of individuals.
(Many examples may be found in C. V. Charlier,
Mathematische Statistik..., Lund, 1920.) Apart from
its various probabilistic applications the Central Limit
Theorem is obviously a remarkable theorem of analysis.

4. We turn to considerations which lie in a very
different direction. We remember that in the derivation
of Bernoulli's theorem we used the fundamental con-
cept of probabilistic (or “stochastic”) independence.
Independence plays a central role in probability the-
ory. It corresponds to the daily experience that we
may, for example, assume that trials performed in
distant parts of the world do not influence each other.
In the example of independent Bernoulli trials it means
mathematically that the probability of obtaining in n
such trials x heads and n - x tails in a given order
equals qxpn-x.

In 1909, É. Borel, the French mathematician, gave
a purely mathematical illustration of independence.
Consider an ordinary decimal fraction, e.g., 0.246.
There exist 1,000 such numbers with three digits, as
0.000, 0.001,..., 0.999. The Laplacean probability
of the particular number 0.246 equals therefore 1/1000,
or (Π denoting “probability”): Π(d1 = 2, d2 = 4,
d3 = 6) = 1/1000,
, where di means ith decimal digit.
Now obviously: Π (d2 = 2) = 1/10, (d1 = 2) = 1/10(d2 = 4) = 1/10
, etc. Hence, Π(d1 = 2, d2 = 4, d3 = 6) = Π
(d1 = 2) · Π(d2 = 4) · Π(d3 = 6) and we may then say
with Borel that “the decimal digits are mutually inde-
pendent.” The meaning of t = 0.246 is t = x1(t)/10 +
x2(t) = 4, x3(t) = 6.
x2(t)/100 + ..., where x1(t) = 2, x2(t) = 4, x3(t) = 6.
Now we define analogously the binary expansion of
a decimal fraction t between 0 and 1, namely
t = ε1(t)/2 + ε2(t)/4 + ε3(t)/8 + ... where the εi(t)
are 0 or 1. For example, 1/3 = 0.0101010.... The
binary digits are mutually independent; this is proved
just as before for the decimal digits.

Now a sequence 0.ε1ε2ε3... is, on the one hand,
the binary expansion of a number t (between 0 and
1) and, on the other hand, a model of the usual game
of tossing heads or tails, if, as always, 0 means tails,
and 1 means heads. Hence this game becomes now a
mathematical object to which a calculus can be applied
without getting involved with coins, events, dice and
trials. The existence of such a model was apt to calm
the uneasiness felt by mathematicians and, at the same
time, to stimulate the interest in probability.

5. In 1919 von Mises' “Grundlagen” (Section VII,
1) appeared, followed by his books of 1928 and 1931.
His critical evaluation of Laplace's foundations (Sec-
tion V, 2) his distinction between mathematical results
and statements about reality, his introduction of some
basic mathematical concepts (label space, distribution
function, principle of randomness, to mention only a
few) brought about a new interest in the foundations
and, at the same time, pointed a way to an improved
understanding of the applications whose number and
importance kept increasing.

6. A few comments on the most important modern
applications of probability which, in turn, strengthened
the mathematics of probability, may seem in order.
We have seen in our consideration of the theory of
errors that, in the world of macro-mechanics, physical
measurements have only a limited accuracy. It was the
aim reached by Laplace and by Gauss to link error
theory to probability theory. A more essential connec-
tion between probability and a physical theory
emerged when statistical mechanics (Clausius, Max-

620

well, Boltzmann, and Gibbs) embarked on a proba-
bilistic interpretation of thermodynamical magni-
tudes; in particular, entropy was given in probabilis-
tic terms, and for the first time a major law of nature
was formulated as a statistical proposition. Striking
success of statistical arguments in the explanation of
physical phenomena appeared in the statistical inter-
pretation of Brownian motion (Einstein, Smoluchovski).
However, the great time of probability in physics is
linked to quantum theory (started by Max Planck,
1900). There, discontinuity is essential (in contrast to
continuity—determinism—differential equations, the
domain of classical physics). In the new microphysics,
differential equations connect probability magnitudes.
Probability permeates the whole world of micro-
physics.

Another important field of application of probability
is genetics. The beginning of our century saw the
reawakening of Mendel's almost forgotten probability
theory of genetics which keeps growing in breadth as
well as in depth.

7. We return to probability as a piece of mathe-
matics proper. Early, in Russia, P. L. Chebychev
(1821-94) carried on brilliantly the work of Laplace.
His student, A. A. Markov investigated various aspects
of nonindependent events. In particular, the “Markov
chains,” which play a great role in mathematics as well
as in physics, are still vigourously studied today. The
great time of mathematical probability continued in
Russia and re-emerged in France, and other countries.
Paul Lévy initiated the theory of so-called “stable”
distributions. De Finetti introduced the concept of
“infinitely divisible” distributions, a theory forcefully
developed by P. Lévy, A. N. Kolmogorov, A.
Khintchine, and others. These are but a few examples.
Probability became very attractive to mathematicians,
who felt more and more at home in a subject whose
structure seemed to fit into real analysis, in particular,
measure theory (subsection 8, below).

It became also apparent that methods which proba-
bility had developed lead to results in purely mathe-
matical fields. In M. Kac's book Statistical Inde-
pendence in Probability, Analysis and Number Theory
(New York, 1959) chapter headings like “Primes play
a game of chance” or “The Normal Law in number
theory” exhibit connections by their very titles. “Prob-
ability theory,” comments M. Kac (in an article in
The Mathematical Sciences. A Collection of Essays,
Cambridge [1969], p. 232), “occupies a unique position
among mathematical disciplines because it has not yet
grown sufficiently old to have severed its natural ties
with problems outside of mathematics proper, while
at the same time it has achieved such maturity of
techniques and concepts it begins to influence other
branches of mathematics.” (This is certainly true for
probability, but it is less certain that it applies only
to probability.)

8. The impressive mathematical accomplishments of
probability, along with its growing importance in
scientific thought, led to the realization that a purely
mathematical foundation of sufficient generality, and,
if possible, in axiomatic form, was desirable. Vari-
ous attempts in this direction culminated in A. N.
Kolmogorov's Grundbegriffe der Wahrscheinlichkeits-
rechnung (Berlin, 1933). Kolmogorov's aim was to
conceive the basic concepts of probability as ordinary
notions of modern mathematics. The basic analogy is
between “probability” of an “event” and “measure”
of a “set,” where set and measure are taken as general
and abstract concepts.

Measure is a generalization of the simple concepts
of “length,” “area,” etc. It applies to point sets which
may be much more general than an interval or the
inside of a square. The generalization from “length”
to “Jordan content” to “Lebesgue measure” is such that
to more and more complicated sets of points a measure
is assigned. In a parallel way the “Cauchy integral”
has been generalized to the “Riemann integral” and
to the “Lebesgue integral.”

9. In what precedes, our label space S has been a
finite or countable set of points in an interval. For
Kolmogorov, the label space is a general set S of “ele-
ments” and T, a “field” consisting of subsets of S, is
the field of “elementary events” which contains also
S and the empty set ∅. To each set A of S is associated
a nonnegative (n.n.) number P(A) between 0 and 1,
called the measure or the probability of A and P(S) = 1,
P(∅) = 0. Suppose now first that T contains only finitely
many sets. If a subset A of T is the sum of n mutually
exclusive sets Ai of T, i.e., A = A1 + A2 + ... + An
then it is assumed that P(A) = P(A1) + P(A2) + ... +
P(An) and P is then called an additive set function over
T. The above axioms define a finite probability field.
An example: S is any finite collection of points, e.g.,
1, 2, 3,..., 99, 100. To each of these integers corre-
sponds a nonnegative number pi between 0 and 1 such
that the sum of all these pi equals 1. T consists of all
subsets of this S and to a set A of T consisting of the
numbers i1, i2,..., ir the P(A) is the sum of the r
probabilities of these points. This apparently thin
framework is already rather general since S, T, and
P underlie only the few mentioned formal restrictions.

10. Kolmogorov passes to infinite probability fields,
where T may contain infinitely many sets. If now a
subset A of T is a sum of countably many disjoint sets
Ai of T, i.e., A = A1 + A2 + ...,, then it is assumed
that P(A) = P(A1) + PA2 + ... and P is called a
completely additive or σ-additive set function. A so-

621

called σ-field is defined in mathematics by the property
that all countable sums of sets Ai of the field belong
likewise to it. It seems desirable to Kolmogorov to
demand that the σ-additive set functions of probability
calculus be defined on σ-fields. The simplest example
of such an infinite probability field is obtained by taking
for S a countable collection of sets, for example, the
positive integers, and assigning to each a n.n. number
pi, such that ∑p1 = 1. For T one takes all subsets of
S and for a set A of T as its probability P(A) the sum
of the pi of the points which form A. Another most
important example is obtained by choosing a n.n. func-
tion f(x), called probability density, defined in an in-
terval (a, b) [or even in (- ∞, + ∞)] such that
ʃba f(x) dx = 1. T is an appropriate collection of sets
A in (a, b), for example, the so-called Borel sets, and
P(A) = ∫Af(x)dx. The integrals in these definitions and
computations are Lebesgue integrals.

Such probability fields may now be defined also in
the plane and in three-dimensional, or n-dimensional
space.∞

The next generalization concerns infinitely-dimen-
sional spaces where one needs a countable number of
coordinates for the definition of each elementary event.

The above indications give an idea of the variety
and generality of Kolmogorov's probability fields. His
axiomatization answered the need for a foundation
adapted to the mathematical aspect of probability. The
loftiness of the structure provides ample room to fill
it with various contents.

11. These foundations are not in competition with
those of von Mises. Kolmogorov axiomatizes the math-
ematical principles of probability calculus, von Mises
characterizes probability as an idealized frequency in
a random sequence. Ideally, they should complement
each other. However, the integration of the two aspects
is far from trivial (Section IX).

One must also remain conscious of the fact that from
formal definitions and assumptions which the above
axioms offer, only formal conclusions follow, and this
holds no matter how we choose the S, T, and P of
subsections 9 and 10. In measure theories of probability
the relation to frequency and to randomness is often
introduced as a more or less vague afterthought which
neglects specific difficulties. On the other hand, a
definition like Mises' cannot replace the fixing of the
axiomatic framework and the measure-theoretical
stringency. We shall return to these points of view and
problems in our last section.

IX. SOME RECENT DEVELOPMENTS

1. In Section VII, 4-8 we introduced and explained
the concept of probability as an idealized frequency.
In Section VIII, 8-10 we indicated an axiomatic set-
theoretical framework of probability theory. We have
seen in this article that these two aspects—frequency
and abstract-mathematical theory—were present from
the seventeenth century on. However, this duality was
not considered disturbing. We have only to think of
Laplace: his aprioristic probability definition, his
mathematics of probability and his work on appli-
cations (for both of which his definition was often not
a sufficient basis) coexisted peacefully for more than
a hundred years although in some respects not consist-
ent with one another. It is only in this century that
the Laplacean framework was found wanting. The
erosion started from both ends: the scientists using
probability and statistics found Laplace's concept
insufficient, and the development of mathematics
greatly outstripped Laplacean rigor. Clarity about prob-
ability as a branch of mathematics, on the one hand,
and of its relation to physical phenomena, on the other
hand, was reached only in the twentieth century. These
two aspects are rightly associated with the names of
Kolmogorov and von Mises.

2. It would be a mistake to think that either von
Mises or Kolmogorov negated or were not conscious
of the problems arising from this duality. It might be
more adequate to say that each man considered the
questions connected with the other aspect as somehow
of second order and not in need of strong intellectual
effort on his part. We illustrate this point by examples.

We remember that von Mises' collective is defined
by two postulates: (α) existence of frequency limits,
(β) insensitivity to place selections. His work intro-
duces a wealth of clarifying concepts, also of a purely
mathematical nature, which are used today by most
probabilists. In places, however, mathematical preci-
sion was lacking; we mention two instances.

As the first one we recall the difficulty reported and
discussed in Section VII, 8. The second concerns a gap
that has hardly been referred to by the critics of von
Mises' system, namely that his collective, in its original
form, applied only to the discrete label space, a space
consisting of a finite or countable number of points.
A continuous label space contains as subsets a wide
variety of sets of points. In most, if not in all of his
publications, von Mises does not bother about the
adaption of his theory to general point sets, but con-
siders this an obvious matter once the concept of
collective has been explained. (He spoke, for example,
of “all practically arising sets.”) We shall return to this
matter in subsections 4 and 5 below.

Kolmogorov's set-theoretical foundations were
accepted gladly by the majority of probabilists as the
definitive solution of the problem of foundations of
probability. With respect to the interpretation of his
abstract probability concept Kolmogorov points

622

explicitly and repeatedly to von Mises' frequency the-
ory. However, within the framework of Kolmogorov's
theory this interpretation meets serious difficulties.

Kolmogorov's theory is built on Lebesgue's measure
theory. Now it can be shown that a frequency inter-
pretation of probability (whose desirability Kolmogorov
emphasizes) is mathematically incompatible with the
use of Lebesgue's theory. One cannot have it both ways:
Lebesgue-Kolmogorov generality is not consistent with
a frequency interpretation.

3. Von Mises' label space was too unsophisticated.
Kolmogorov's mathematics is too general to admit
always a frequency interpretation (and no other inter-
pretation is known) of his probability. Analysis of these
shortcomings should lead to a more unified theory. The
following is a report on some attempts in this direction.

As stated in Section VII, 8, Wald has proved—under
certain conditions—the consistency of the concept of
collective. Being both a student of von Mises and of
the set theoretician K. Menger, Wald in the course of
this work could not fail to discern those fields of sets
to which a probability with frequency meaning can
be assigned. Before Wald, E. Tornier, the mathema-
tician, presented an axiomatic structure, different from
both von Mises' and Kolmogorov's, and compatible
with frequency interpretation. H. Geiringer, much
influenced by Tornier and Wald, took a fairly elemen-
tary starting point where concepts like “decidable” and
“verifiable” play a role. (The following paper by
Geiringer is easily accessible and contains all the
quotations, on pp. 6 and 15, of Wald's and Tornier's
works, which are all in German: H. Geiringer, “Proba-
bility Theory of Verifiable Events,” Archive for Ra-
tional Mechanics and Analysis, 34 [1969], 3-69.)

4. (a) Our eternal die (true or biased) is tossed and
we ask for the probability that in n = 100 tosses “ace”
will turn up at least 20 times. The event A under
consideration is “at least 20 aces in 100 tosses.” (The
problem is of the type of that of Monsieur de Méré,
discussed in Section I, 3.) The single “trial” consists
of at least 20 and at most 100 tosses. If in such a trial
“ace” turns up at least 20 times we say that the “event”
(or the set) A has emerged; otherwise non-A = A′
resulted. Clearly, after each trial we know with cer-
tainty whether the result is A or A′. Hence, repeating
the trial n times we obtain nA/n, the frequency of A,
which we take as an approximation to P(A). Problems
like (a) are strictly decidable.

(b) Next remember the elementary concept of a
rational number: it is the quotient of two integers; and
we know that the decimal form of a rational (between
0 and 1, say) is either finite like 0.7, or periodic like
0.333... = 1/3, or 0.142857142857... = 1/7. Call R
the set of rationals between 0 and 1 and R′ that of
irrationals in this interval. We want a frequency ap-
proximation to P(R), the “probability of R.”

Imagine an urn containing, in equal proportions, lots
with the ten digits 0, 1, 2,..., 9. We draw numbers
out of the urn, note each number before replacing it
and originate in this way a longer and longer decimal
number. The single “trial” consists of as many draws
as needed to decide whether this decimal is rational
or not (belongs to R or to R′). It is, however, impossible
to reach this decision by a finite number of draws—and
we cannot make infinitely many. If after n = 10,000
draws a “period” has not emerged it may still emerge
later; if some period seems to have emerged the next
draw could destroy it. Not one single trial leads to a
decision. The problem is undecidable.

5. In Lebesgue's theory, R has a measure (equal to
0). But, assigning this measure ∣R∣ to R as its proba-
bility means renouncing any frequency interpretation
of this “probability.” A probability should be “verifi-
able,” i.e., an approximation by means of a frequency
should be in principle possible. But any attempt to
verify ∣R∣ fails. The conclusion (Tornier, Geiringer)
is that to the set R (and to R′) no probability can be
assigned in a frequency theory. This is not a quibble
about words but a genuine and important distinction.
If somebody wants to call ∣R∣ a probability then we
need a new designation like “genuine probability” for
sets like those in (a).

It is easy to characterize mathematically sets like R
which have measure but not a verifiable probability.
However, such a description would not be of much
help to the nonmathematician.

(c) There is a third class of sets which are more
general than (a) but admit verifiable probabilities. It
is this class of sets which, in von Mises' theory, should
have been added to class (a). Again we have to forego
a mathematical characterization.

6. Von Mises dealt exclusively with sets of the
strictly decidable type (a). This, however, does not
imply that a von Mises-probability can be ascribed to
no continuous manifold. Consider, e.g., an interval or
the area of a circle. An area as a whole is verifiable.
Imagine a man shooting at a target. By assigning
numbers to concentric parts of the target, beginning
with “1” for the bull's eye including its circular bound-
ary, and ending with the space outside the last ring,
we can characterize each shot by a number and, we
have a problem similar to that of tossing dice.

Similarly, on a straight line a label space may consist,
for example, of the interval between 0 and 10. We
can then speak of the probability of the interval (2.5,
3.7) or any other interval in (0, 10). These are problems

623

of type (a), although the data are given in a different
way (Section VIII, 10). We ought to understand that
the total interval (0, 1) say, has a probability, but
certain point sets in (0, 1), like R or R′, are of type
(b) and have no probability, although they have
Lebesgue measure. The distinction which we sketched
here very superficially (subsections 4 and 5, above),
shows in what direction von Mises' theory should be
extended beyond its original field and up to certain
limits. But these same bounds should also restrain the
generality of measure theories of probability insofar
as these are to admit frequency interpretation.

7. Reviewing the development we can no longer
feel that the measure-theoretical axiomatics of proba-
bility has solved all riddles. It has fulfilled its purpose
to establish probability calculus as a regular branch
of mathematics but it does not help our understanding
of randomness, of degree of certainty, of the Monte
Carlo method, etc.

It thus feels remarkable but understandable that in
1963 Kolmogorov himself again took up the concept
of randomness. He salutes the frequency concept of
probability “the unavoidable nature of which has been
established by von Mises.” He then states that for many
years he was of the opinion that infinite random se-
quences (as used by von Mises) are “not close enough
to reality” while “finite random sequences cannot
admit mathematization.” He has, however, now found
a formalization of finite random sequences and pre-
sents it in this paper. The results are interesting, but, of
necessity rather meager.

Further investigations on random sequences by
R. I. Solomonov, P. Martin Löf, Kolmogorov, G. J.
Chaitin, D. L. Loveland, and, particularly, C. P.
Schnorr are in progress. These investigations (which
are also of interest to other branches of mathematics)
use the concepts and tools of mathematical logic. The
new random sequences point of necessity back to von
Mises' original ideas and some of them study success-
fully the links between the various concepts.

8. In this section we have sketched two aspects of
recent development. The first one concerned attempts
to work out the mathematical consequences of the
postulate (or assumption) that a frequency inter-
pretation of probability is possible. This postulate,
basic in von Mises' theory, had been considered by
Kolmogorov as rather obvious and not in need of par-
ticular study. Our second and last subject gave a few
indications regarding the analysis of randomness in
terms of mathematical logic. The problems and results
considered here in our last section seem to point to-
wards a new synthesis of the basic problems of proba-
bility theory.

BIBLIOGRAPHY

Jakob (James) Bernoulli, Ars conjectandi (Basel, 1713;
Brussels, 1968). R. Carnap, Logical Foundations of Proba-
bility (Chicago, 1950). H. Cramér, Mathematical Methods
of Statistics (Princeton, 1946). F. N. Davis, Games, Gods,
and Gambling (New York, 1962). R. L. Ellis, On the Foun-
dations of the Theory of Probability (Cambridge, 1843).
J. M. Keynes, A Treatise on Probability (London, 1921). A. N.
Kolmogorov, Grundbegriffe der Wahrscheinlichkeitsrechnung
(Berlin, 1933). D. V. Lindley, Introduction to Probability and
Statistics from a Bayesian Viewpoint (Cambridge, 1965).
R. von Mises, Wahrscheinlichkeit, Statistik und Wahrheit
(Vienna, 1928); trans. as Probability, Statistics and Truth,
3rd ed. (New York, 1959); idem, Wahrscheinlichkeitsrechnung
und ihre Anwendung in der Statistik und theoretischen
Physik (Vienna, 1931); trans. as Mathematical Theory of
Probability and Statistics, ed. and supplemented by Hilda
Geiringer (New York, 1964). E. Nagel, “Principles of the
Theory of Probability,” International Encyclopedia of
Unified Science (Chicago, 1939), I, 6. C. S. Peirce, “The
Doctrine of Chances,” Popular Science Monthly, 12 (1878),
604-15; idem, “A Theory of Probable Inference,” reprinted
in collected Papers (Boston, 1883), II, 433-77. H. Reichen-
bach, The Theory of Probability (Istanbul, 1934; 2nd ed.
Los Angeles, 1949). I. Todhunter, A History of the Mathe-
matical Theory of Probability, From the Time of Pascal to
that of Laplace (Cambridge, 1865; reprint New York, 1931).
J. Venn, The Logic of Chance (London, 1866). E. T.
Whittaker and G. Robinson, The Calculus of Observations
(London, 1932).

HILDA GEIRINGER

[See also Certainty; Chance; Determinism; Game Theory;
Primitivism; Progress
in the Modern Era; Pythagorean...; Rationality; Utopia.]