Log probability: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 21:21, 14 August 2017 edit Yoderj (talk \| contribs) 467 edits m Typo ← Previous edit		Latest revision as of 06:13, 19 November 2024 edit undo 62.174.174.15 (talk) No edit summary Tags: Mobile edit Mobile web edit
(27 intermediate revisions by 22 users not shown)
Line 1: {{Short description\|Logarithm of probabilities, useful for calculations}} ~~{{unreferenced\|date=May 2011}}~~ {{distinguish\|log odds}} In [[computer science]], a [[log probability]] is simply the logarithm of a probability. The use of '''log probabilities''' means representing [[probability\|probabilities]] in [[logarithm]]ic space, instead of the standard <math>[0, 1]</math> [[interval (mathematics)\|interval]]. This has practical advantages, because of the way in which computers [[Floating point\|approximate real numbers]], and because computers can historically perform addition more efficiently than multiplication. Multiplication arises from calculating the probability that multiple independent events occur: the probability that all independent events of interest occur is the product of all these events' probabilities. In [[probability theory]] and [[computer science]], a '''log probability''' is simply a [[logarithm]] of a [[probability]].<ref name="chrispiech" /> The use of log probabilities means representing probabilities on a [[logarithmic scale]] <math>(-\infty, 0]</math>, instead of the standard <math>[0, 1]</math> [[unit interval]]. Since the probabilities of [[Independence (probability theory)\|independent]] [[event (probability theory)\|events]] multiply, and logarithms convert multiplication to addition, log probabilities of independent events add. Log probabilities are thus practical for computations, and have an intuitive interpretation in terms of [[information theory]]: the negative [[expected value]] of the log probabilities is the [[information entropy]] of an event. Similarly, [[likelihood]]s are often transformed to the log scale, and the corresponding [[log-likelihood]] can be interpreted as the degree to which an event supports a [[statistical model]]. The log probability is widely used in implementations of computations with probability, and is studied as a concept in its own right in some applications of information theory, such as [[natural language processing]]. The logarithm function is not defined for zero, so log probabilities can only represent non-zero probabilities. Since the logarithm of a number in <math>[0, 1)</math> interval is negative, often the negative log probabilities are used. In that case the log probabilities in the following formulas would be [[Additive inverse\|inverted]]. Any base can be selected for the logarithm.▼ ==Motivation== Representing probabilities in this way has ~~two~~several ~~main~~practical advantages:▼ # '''Speed.''' Since multiplication is more [[time complexity\|expensive]] than addition, taking the product of a high number of probabilities is often faster if they are represented in log form. (The conversion to log form is expensive, but is only incurred once.) Multiplication arises from calculating the probability that multiple independent events occur: the probability that all independent events of interest occur is the product of all these events' probabilities.▼ # '''Accuracy.''' The use of log probabilities improves [[numerical stability]], when the probabilities are very small, because of the way in which computers [[Floating point\|approximate real numbers]].<ref name="chrispiech" />▼ # '''Simplicity.''' Many [[probability distribution]]s have an exponential form. Taking the log of these distributions eliminates the exponential function, unwrapping the exponent. For example, the log probability of the normal distribution's [[probability density function]] is <math>-((x-m_x)/\sigma_m)^2+C</math> instead of <math>C_2 \exp\left(-((x-m_x)/\sigma_m)^2\right)</math>. Log probabilities make some mathematical manipulations easier to perform. # '''Optimization.''' Since most common [[probability distribution]]s—notably the [[exponential family]]—are only [[Logarithmically concave function\|logarithmically concave]],<ref>{{citation \|first1=Robert E. \|last1=Kass \|first2=Paul W. \|last2=Vos \|title=Geometrical Foundations of Asymptotic Inference \|___location=New York \|publisher=John Wiley & Sons \|year=1997 \|isbn=0-471-82668-5 \|page=14 \|url=https://books.google.com/books?id=e43EAIfUPCwC&pg=PA14 \|mode=cs1 }}</ref><ref>{{cite web \|first=Alecos \|last=Papadopoulos \|title=Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)? \|date=September 25, 2013 \|work=[[Stack Exchange]] \|url=https://stats.stackexchange.com/q/70975 }}</ref> and [[Concave function\|concavity]] of the [[objective function]] plays a key role in the [[Mathematical optimization\|maximization]] of a function such as probability, optimizers work better with log probabilities. == Representation issues == ▲The logarithm function is not defined for zero, so log probabilities can only represent non-zero probabilities. Since the logarithm of a number in <math>[(0, 1)</math> interval is negative, often the negative log probabilities are used. In that case the log probabilities in the following formulas would be [[Additive inverse\|inverted]]. ~~Any base can be selected for the logarithm.~~ Any base can be selected for the logarithm. == Basic manipulations == {{see\|Log semiring}} In this section we would name probabilities in logarithmic space <math>x'</math> and <math>y'</math> for short: : <math>x' = \log(x) \in \mathbb{R}</math> : <math>y' = \log(y) \in \mathbb{R}</math> Line 9 ⟶ 27: The product of probabilities <math>x \cdot y</math> corresponds to addition in logarithmic space. : <math>\log(x \cdot y) = \log(x) + \log(y) = x' + y' .</math>. The [[#Addition in log space\|sum of probabilities]] <math>x + y</math> is a bit more involved to compute in logarithmic space, requiring the computation of one exponent and one logarithm. However, in many applications a multiplication of probabilities (giving the probability of all independent events occurring) is used more often than their addition (giving the probability of at least one of ~~them~~mutually exclusive events occurring). Additionally, the cost of computing the addition can be avoided in some situations by simply using the highest probability as an approximation. Since probabilities are non-negative this gives a lower bound. This approximation is used in reverse to get a [[LogSumExp\|continuous approximation of the max function]]. ===Addition in log space===▼ ▲Representing probabilities in this way has two main advantages: : <math>\begin{align} ▲# '''Speed.''' Since multiplication is more [[time complexity\|expensive]] than addition, taking the product of a high number of probabilities is faster if they are represented in log form. (The conversion to log form is expensive, but is only incurred once.) ~~: <math>~~&\log(x + y)~~</math>~~ \\▼ ▲# '''Accuracy.''' The use of log probabilities improves [[numerical stability]], when the probabilities are very small. ~~: <math>~~= {}& \log(x + x \cdot y / x)~~</math>~~ \\▼ ~~: <math>~~= {}& \log(x + x \cdot \exp(\log(y / x)))~~</math>~~ \\▼ ~~: <math>~~= {}& \log(x \cdot (1 + \exp(\log(y) - \log(x))))~~</math>~~ \\▼ ~~: <math>~~= {}& \log(x) + \log(1 + \exp(\log(y) - \log(x)))~~</math>~~ \\▼ ~~: <math>~~= {}& x' + \log\left(1 + \exp\left(y' - x'\right)\right)~~</math>~~▼ \end{align}</math> The formula above is more accurate than <math> \log\left(e^{x'} + e^{y'}\right)</math>, provided one takes advantage of the asymmetry in the addition formula. <math> {x'} </math> should be the larger (least negative) of the two operands. This also produces the correct behavior if one of the operands is [[Floating-point arithmetic\|floating-point]] [[negative infinity]], which corresponds to a probability of zero.▼ The use of log probabilities is widespread in several fields of computer science such as [[information theory]] and [[natural language processing]] as it represents the [[surprisal]], the minimum length of the message that specifies the outcome in an optimally efficient code. : <math>-\infty + \log\left(1 + \exp\left(y' - (-\infty)\right)\right) = -\infty + \infty </math> This quantity is [[Indeterminate form\|indeterminate]], and will result in [[NaN]].▼ ▲==Addition in log space== : <math> x' + \log\left(1 + \exp\left(-\infty - x'\right)\right) = x' + 0</math> This is the desired answer.▼ ~~Note that the~~The above formula alone will incorrectly produce an indeterminate result in the case where both arguments are <math> -\infty </math>. This should be checked for separately to return <math> -\infty </math>.▼ ▲: <math>\log(x + y)</math> ▲: <math>= \log(x + x \cdot y / x)</math> ▲: <math>= \log(x + x \cdot \exp(\log(y / x)))</math> ▲: <math>= \log(x \cdot (1 + \exp(\log(y) - \log(x))))</math> ▲: <math>= \log(x) + \log(1 + \exp(\log(y) - \log(x)))</math> ▲: <math>= x' + \log(1 + \exp(y' - x'))</math> ~~Note also that for~~For numerical reasons, one should use a function that computes <math> \log(1+x) </math> ([[Natural logarithm#lnp1\|log1p]]) directly.▼ ▲The formula above is more accurate than <math> \log(e^{x'} + e^{y'})</math>, provided one takes advantage of the asymmetry in the addition formula. <math> {x'} </math> should be the larger (least negative) of the two operands. This also produces the correct behavior if one of the operands is floating-point negative infinity, which corresponds to a probability of zero. ==See also== ▲: <math>-\infty + \log(1 + \exp(y' - (-\infty))) = -\infty + \infty </math> This quantity is [[Indeterminate form\|indeterminate]], and will result in [[NaN]]. * [[Information content]] ▲: <math> x' + \log(1 + \exp(-\infty - x')) = x' + 0</math> This is the desired answer. * [[Log-likelihood]] == References == ▲Note that the above formula alone will incorrectly produce an indeterminate result in the case where both arguments are <math> -\infty </math>. This should be checked for separately to return <math> -\infty </math>. <references> <ref name="chrispiech">{{cite web \|last1=Piech \|first1=Chris \|title=Probability for Computer scientists - Log probabilities \|url=https://chrispiech.github.io/probabilityForComputerScientists/en/part1/log_probabilities/ \|access-date=20 July 2023}}</ref> </references> ▲Note also that for numerical reasons, one should use a function that computes <math> \log(1+x) </math> ([[Natural logarithm#lnp1\|log1p]]) directly. [[Category:Logarithms\|Probability]] [[Category:Mathematics of computing]]