Bayes' theorem is a result in probability theory. It yields the conditional probability distribution of a random variable A, assuming we know:
- information about another variable B in terms of the conditional probability distribution of B given A, and
- the marginal probability distribution of A alone.
This article gives a formal mathematical discussion of the theorem, some of its extensions, and an example of its use. As a formal theorem, it is valid regardless of how one interprets probability. However, frequentist and Bayesian interpretations disagree about the kinds of variables with which the theorem can be validly used for statistical inference —the articles on Bayesian probability and frequentist probability discuss these debates at greater length.
Non-technical explanation
Simply put, Bayes’ theorem gives the probability of a random event A occurring given that we know a related event B occurred. This probability is noted P(A|B), and is read "probability of A given B". This measure is sometimes called the "posterior", since it is computed after all other information on A and B is known.
According to Bayes’ theorem, the probability of A occurring given B will be dependent on three things:
- The probability of A occurring on its own, regardless of B. This is noted P(A) and read "probability of A". This measure is sometimes called the "prior", meaning it precedes any other information – as opposed to the posterior, defined above, which is computed after all other information is known.
- The probability of B occurring on its own, regardless of A. This is noted P(B) and read "probability of B". This measure is sometimes called the normalising constant, since it will always be the same, regardless of which event A one is studying.
- The probability of B occurring given that A occurred. This is noted P(B|A) and is read "probability of B given A". This measure is sometimes called the likelihood, since it is the likelihood of A occurring given that B occurred. It is important not to confuse the likelihood of A given B and the probability of A given B. Even though both notions may seem similar and are related, they are quite different.
Given these three measures, the probability of A given B can be computed as:
Example
- More examples can be found on the page on Bayesian inference.
To illustrate, suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip cookies and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. But first, we can clarify the situation by rephrasing the question to "what’s the probability that Fred picked bowl #1, given that he has a plain cookie?” Thus, to relate to our previous explanation, the event A is that Fred picked bowl #1, and the event B is that Fred picked a plain cookie. To compute Pr(A|B), we first need to know:
- Pr(A), or the probability that Fred picked bowl #1 regardless of any other information. Since Fred is treating both bowls equally, it is 0.5.
- P(B), or the probability of getting a plain cookie regardless of any information on the bowls. In other words, this is the probability of getting a plain cookie from each of the bowls. It is computed as the sum of the probability of getting a plain cookie from a bowl multiplied by the probability of selecting this bowl. We know from the problem statement that the probability of getting a plain cookie from bowl #1 is 0.75, and the probability of getting one from bowl #2 is 0.5, and since Fred is treating both bowls equally the probability of selecting any one of them is 0.5. Thus, the probability of getting a plain cookie overall is 0.75×0.5 + 0.5×0.5 = 0.625.
- Pr(B|A), or the probability of getting a plain cookie given that Fred has selected bowl #1. From the problem statement, we know this is 0.75, since 30 out of 40 cookies in bowl #1 are plain.
Given all this information, we can compute the probability of Fred having selected bowl #1 given that he got a plain cookie, as such:
As we expected, it is more than half.
Historical remarks
Bayes' theorem is named after the Reverend Thomas Bayes (1702—1761), who studied how to compute a distribution for the parameter of a binomial distribution (to use modern terminology). His friend, Richard Price, edited and presented the work in 1763, after Bayes' death, as An Essay towards solving a Problem in the Doctrine of Chances. Pierre-Simon Laplace replicated and extended these results in an essay of 1774, apparently unaware of Bayes' work.
One of Bayes' results (Proposition 5) gives a simple description of conditional probability, and shows that it does not depend on the order in which things occur:
- If there be two subsequent events, the probability of the second b/N and the probability of both together P/N, and it being first discovered that the second event has also happened, the probability I am right [i.e., the conditional probability of the first event being true given that the second has happened] is P/b.
Bayes' main result (Proposition 9 in the essay) is the following: assuming a uniform distribution for the prior distribution of the binomial parameter p, the probability that p is between two values a and b is
where m is the number of observed successes and n the number of observed failures. His preliminary results, in particular Propositions 3, 4, and 5, imply the result now called Bayes' Theorem (as described below), but it does not appear that Bayes himself emphasized or focused on that result.
What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter p. So, one can compute probability for an experimental outcome, but also for the parameter which governs it, and the same algebra is used to make inferences of either kind.
Bayes states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter p depend on a random event, he escapes a philosophical quagmire of which he most likely was not even aware.
Statement of Bayes' theorem
Bayes' theorem relates conditional and marginal probabilities. To derive the theorem, start from the definition of conditional probability
It reads: The probability of A given B times the probability of B equals the probability of both events A and B occurring together and also equals the probability of B given A times the probability of A.
Dividing the left- and right-hand sides by P(B) providing that it is non-zero, we obtain
which is conventionally known as Bayes' theorem.
It reads: The probability of A given B equals the probability of B given A times the probability of A, divided by the probability of B.
Each term in Bayes' theorem has a conventional name.
- P(A) is the prior probability or marginal probability of A. "Prior" means it precedes any information about B.
- P(A|B) is the posterior probability of A, given B. "Posterior" means it is derived from or entailed by the specified value of B.
- P(B|A), for a specific value of B, is the likelihood function for A given B. It may also be written as L(A|B).
- P(B) is the prior or marginal probability of B, and acts as the normalizing constant.
With this terminology, the theorem may be paraphrased as
In addition, the ratio P(B|A)/P(B) is known as the standardised likelihood, and the theorem may be written
Alternative forms of Bayes' theorem
Bayes' theorem is often embellished by noting that
so the theorem can be restated as
where AC is the complementary event of A (often called "not A"). More generally, where {Ai} forms a partition of the event space,
for any Ai in the partition.
It can also be written neatly in terms of a likelihood ratio and odds as
where
See also the law of total probability.
Bayes' theorem for probability densities
There is also a version of Bayes' theorem for continuous distributions. It is somewhat harder to derive, since probability densities, strictly speaking, are not probabilities, so Bayes' theorem has to be established by a limit process; see Papoulis (citation below), Section 7.3 for an elementary derivation. Bayes' theorem for probability densities is formally similar to the theorem for probabilities:
and there is an analogous statement of the law of total probability:
As in the discrete case, the terms have standard names. f(x, y) is the joint distribution of X and Y, f(x|y) is the posterior distribution of X given Y=y, f(y|x) = L(x|y) is (as a function of x) the likelihood function of X given Y=y, and f(x) and f(y) are the marginal distributions of X and Y respectively, with f(x) being the prior distribution of X.
Here we have indulged in a conventional abuse of notation, using f for each one of these terms, although each one is really a different function; the functions are distinguished by the names of their arguments.
Extensions of Bayes' theorem
Theorems analogous to Bayes' theorem hold in problems with more than two variables. These theorems are not given distinct names, as they may be mass-produced by applying the laws of probability. The general strategy is to work with a decomposition of the joint probability, and to marginalize (integrate) over the variables that are not of interest. Depending on the form of the decomposition, it may be possible to prove that some integrals must be 1, and thus they fall out of the decomposition; exploiting this property can reduce the computations very substantially. A Bayesian network is essentially a mechanism for automatically generating the extensions of Bayes' theorem that are appropriate for a given decomposition of the joint probability.
Example
Applications of Bayes' theorem often assume the philosophy underlying Bayesian probability that uncertainty and degrees of belief can be measured as probabilities. One such example follows. For additional worked out examples, including simpler examples, please see the article on the examples of Bayesian inference.
We describe the marginal probability distribution of a variable A as the prior probability distribution or simply the prior. The conditional distribution of A given the "data" B is the posterior probability distribution or just the posterior.
Suppose we wish to know about the proportion r of voters in a large population who will vote "yes" in a referendum. Let n be the number of voters in a random sample (chosen with replacement, so that we have statistical independence) and let m be the number of voters in that random sample who will vote "yes". Suppose that we observe n = 10 voters and m = 7 say they will vote yes. From Bayes' theorem we can calculate the probability distribution function for r using
From this we see that from the prior probability density function f(r) and the likelihood function L(r) = f(m = 7|r, n = 10), we can compute the posterior probability density function f(r|n = 10, m = 7).
The prior probability density function f(r) summarizes what we know about the distribution of r in the absence of any observation. We provisionally assume in this case that the prior distribution of r is uniform over the interval [0, 1]. That is, f(r) = 1. If some additional background information is found, we should modify the prior accordingly. However before we have any observations, all outcomes are equally likely.
Under the assumption of random sampling, choosing voters is just like choosing balls from an urn. The likelihood function L(r) = P(m = 7|r, n = 10,) for such a problem is just the probability of 7 successes in 10 trials for a binomial distribution.
As with the prior, the likelihood is open to revision -- more complex assumptions will yield more complex likelihood functions. Maintaining the current assumptions, we compute the normalizing factor,
and the posterior distribution for r is then
for r between 0 and 1, inclusive.
One may be interested in the probability that more than half the voters will vote "yes". The prior probability that more than half the voters will vote "yes" is 1/2, by the symmetry of the uniform distribution. In comparison, the posterior probability that more than half the voters will vote "yes", i.e., the conditional probability given the outcome of the opinion poll -- that seven of the 10 voters questioned will vote "yes" -- is
which is about an "89% chance".
See also
References
Versions of the essay
- Thomas Bayes (1763), "An Essay towards solving a Problem in the Doctrine of Chances", Philosophical Transactions of the Royal Society of London, 53.
- Thomas Bayes (1763/1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:296-315 (Bayes's essay in modernized notation)
- Thomas Bayes "An essay towards solving a Problem in the Doctrine of Chances" (Bayes's essay in the original notation)
Commentaries
- G.A. Barnard. (1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:293-295 (biographical remarks)
- Daniel Covarrubias "An Essay Towards Solving a Problem in the Doctrine of Chances" (an outline and exposition of Bayes's essay)
- Stephen M. Stigler (1982) "Thomas Bayes' Bayesian Inference," Journal of the Royal Statistical Society, Series A, 145:250-258 (Stigler argues for a revised interpretation of the essay -- recommended)
- Isaac Todhunter (1865) A History of the Mathematical Theory of Probability from the time of Pascal to that of Laplace, Macmillan. Reprinted 1949, 1956 by Chelsea and 2001 by Thoemmes.
Additional material
- Pierre-Simon Laplace (1774), "Mémoire sur la Probabilité des Causes par les Événements," Savants Étranges 6:621-656, also Oeuvres 8:27-65.
- Pierre-Simon Laplace (1774/1986), "Memoir on the Probability of the Causes of Events", Statistical Science, 1(3):364--378.
- Stephen M. Stigler (1986), "Laplace's 1774 memoir on inverse probability," Statistical Science, 1(3):359--378.
- Stephen M. Stigler (1983), "Who Discovered Bayes's Theorem?" The American Statistician, 37(4):290-296.
- Jeff Miller. Earliest Known Uses of Some of the Words of Mathematics (B) (very informative -- recommended)
- Athanasios Papoulis (1984), Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill.
- James Joyce. "Bayes' Theorem", in the Stanford Encyclopedia of Philosophy.