Concentration parameter: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 08:36, 22 October 2010 edit Benwing (talk \| contribs) Extended confirmed users, Pending changes reviewers 8,038 edits add example usage of small value ← Previous edit		Latest revision as of 13:25, 28 December 2023 edit undo AB-Babayo (talk \| contribs) Extended confirmed users 2,885 edits Added free to read link in citations with OAbot #oabot Tag: OAbot [2.1]
(21 intermediate revisions by 15 users not shown)
Line 1: {{for\|the inverse of a covariance matrix\|Concentration matrix}} {{Unreferenced\|date=September 2010}}▼ In [[probability theory]] and [[statistics]], a '''concentration parameter''' is a special kind of [[numerical parameter]] of a [[parametric family]] of [[probability distribution]]s. Concentration parameters occur in conjunction with distributions whose ___domain is a probability distribution, such as the [[symmetric Dirichlet distribution]] and the [[Dirichlet process]]. ▼ ▲{{~~Unreferenced~~Refimprove\|date=September 2010}} The larger the value of the concentration parameter, the more evenly distributed is the resulting distribution (the more it tends towards the [[uniform distribution]]). The smaller the value of the concentration parameter, the more sparsely distributed is the resulting distribution, with all but a few parameters having a probability near zero (in other words, the more it tends towards a distribution concentrated on a single point, the [[degenerate distribution]] defined by the [[Dirac delta function]]). ▲In [[probability theory]] and [[statistics]], a '''concentration parameter''' is a special kind of [[numerical parameter]] of a [[parametric family]] of [[probability distribution]]s. Concentration parameters occur in two kinds of distribution: In the [[Von Mises–Fisher distribution]], and in conjunction with distributions whose ___domain is a probability distribution, such as the [[symmetric Dirichlet distribution]] and the [[Dirichlet process]]. The rest of this article focuses on the latter usage. InThe larger the ~~case~~value of ~~a Dirichlet distribution, a~~the concentration parameter, ofthe 1more ~~results~~evenly indistributed ~~all~~is ~~sets~~the ofresulting ~~probabilities~~distribution ~~being~~(the ~~equally~~more ~~likely,~~it ~~i.e.~~tends ~~in this case~~towards the ~~Dirichlet~~[[Uniform distribution ~~of dimension ''k'' is equivalent to a~~ (continuous)\|uniform distribution ~~over a ''k''-dimensional simplex~~]]). ~~Note that this is~~The ~~''not''~~smaller the ~~same~~value ~~as what happens when~~of the concentration parameter, ~~tends~~the ~~towards~~more ~~infinity.~~sparsely distributed Inis the ~~former case, all~~ resulting ~~distributions are equally likely (the~~ distribution, ~~over~~with ~~distributions~~most isvalues ~~uniform).~~or ranges Inof ~~the~~values ~~latter~~having ~~case,~~a ~~only~~probability near~~-uniform~~ ~~distributions are likely~~zero (~~the~~in ~~distribution~~other ~~over distributions is highly peaked around the uniform distribution). Meanwhile~~words, in the ~~limit~~more ~~as the concentration parameter~~it tends towards ~~zero,~~a ~~only distributions with nearly all mass~~distribution concentrated on ~~one~~a ofsingle ~~their~~point, ~~components~~the ~~are likely (the~~[[degenerate distribution]] ~~over~~defined ~~distributions is highly peaked around~~by the ~~''k'' possible~~ [[Dirac delta ~~distribution~~function]]s). ~~centered~~ ~~on one of the components, or in terms of the ''k''-dimensional simplex, is highly peaked at corners of the simplex).~~ ==Dirichlet distribution== An example of where a sparse prior (concentration parameter much less than 1) is called for, consider a [[topic model]], which is used to learn the topics that are discussed in a set of documents, where each "topic" is described using a [[categorical distribution]] over a vocabulary of words. A typical vocabulary might have 100,000 words, leading to a 100,000-dimensional categorical distribution. The prior distribution for this distribution would likely be a [[symmetric Dirichlet distribution]]. However, a coherent topic might only have a few hundred words with any significant probability mass. Accordingly, a reasonable setting for the concentration parameter might be 0.01 or 0.001. With a larger vocabulary of around 1,000,000 words, an even smaller value, e.g. 0.0001, might be appropriate.▼ In the case of multivariate Dirichlet distributions, there is some confusion over how to define the concentration parameter. In the topic modelling literature, it is often defined as the sum of the individual Dirichlet parameters,<ref>{{Cite conference\|last=Wallach\|first=Hanna M.\|author-link=Hanna Wallach\|author2=Iain Murray\|author3=Ruslan Salakhutdinov\|author4=David Mimno\|date=2009\|title=Evaluation methods for topic models\|series=ICML '09\|___location=New York, NY, USA\|publisher=ACM\|pages=1105–1112\|doi=10.1145/1553374.1553515\|isbn=978-1-60558-516-1\|book-title=Proceedings of the 26th Annual International Conference on Machine Learning\|citeseerx=10.1.1.149.771}}</ref> when discussing symmetric Dirichlet distributions (where the parameters are the same for all dimensions) it is often defined to be the value of the single Dirichlet parameter used in all dimensions{{Citation needed\|date=November 2011}}. This second definition is smaller by a factor of the dimension of the distribution. A concentration parameter of 1 (or ''k'', the dimension of the Dirichlet distribution, by the definition used in the topic modelling literature) results in all sets of probabilities being equally likely, i.e., in this case the Dirichlet distribution of dimension ''k'' is equivalent to a uniform distribution over a [[Standard simplex\|''k-1''-dimensional simplex]]. This is ''not'' the same as what happens when the concentration parameter tends towards infinity. In the former case, all resulting distributions are equally likely (the distribution over distributions is uniform). In the latter case, only near-uniform distributions are likely (the distribution over distributions is highly peaked around the uniform distribution). Meanwhile, in the limit as the concentration parameter tends towards zero, only distributions with nearly all mass concentrated on one of their components are likely (the distribution over distributions is highly peaked around the ''k'' possible [[Dirac delta distribution]]s centered on one of the components, or in terms of the ''k''-dimensional simplex, is highly peaked at corners of the simplex). ==Sparse prior== ▲An example of where a sparse prior (concentration parameter much less than 1) is called for, consider a [[topic model]], which is used to learn the topics that are discussed in a set of documents, where each "topic" is described using a [[categorical distribution]] over a vocabulary of words. A typical vocabulary might have 100,000 words, leading to a 100,000-dimensional categorical distribution. The [[prior distribution]] for ~~this~~the parameters of the categorical distribution would likely be a [[symmetric Dirichlet distribution]]. However, a coherent topic might only have a few hundred words with any significant probability mass. Accordingly, a reasonable setting for the concentration parameter might be 0.01 or 0.001. With a larger vocabulary of around 1,000,000 words, an even smaller value, e.g. 0.0001, might be appropriate. ==See also== Line 15 ⟶ 21: * [[Scale parameter]] == References == ~~[[Category:Theory of probability distributions]]~~ {{reflist}} [[Category:Statistical terminology]]▼ ▲[[Category:Statistical ~~terminology~~parameters]]