Probabilistic latent semantic analysis: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 06:45, 16 November 2016 edit Sahraoui17 (talk \| contribs) 32 edits No edit summary ← Previous edit		Latest revision as of 06:31, 15 April 2023 edit undo Finlay McWalter (talk \| contribs) Administrators 79,694 edits →See also: per WP:SEEALSO, avoid repeating links in this section
(32 intermediate revisions by 17 users not shown)
Line 10: : <math>P(w,d) = \sum_c P(c) P(d\|c) P(w\|c) = P(d) \sum_c P(c\|d) P(w\|c)</math> with '<math>c'</math> being the words' topic. Note that the number of topics is a hyperparameter that must be chosen in advance and is not estimated from the data. The first formulation is the ''symmetric'' formulation, where <math>w</math> and <math>d</math> are both generated from the latent class <math>c</math> in similar ways (using the conditional probabilities <math>P(d\|c)</math> and <math>P(w\|c)</math>), whereas the second formulation is the ''asymmetric'' formulation, where, for each document <math>d</math>, a latent class is chosen conditionally to the document according to <math>P(c\|d)</math>, and a word is then generated from that class according to <math>P(w\|c)</math>. Although we have used words and documents in this example, the co-occurrence of any couple of discrete variables may be modelled in exactly the same way. So, the number of parameters is equal to <math>cd + wc</math>. The number of parameters grows linearly with the number of documents. In addition, although PLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents. Line 17: == Application == PLSA may be used in a discriminative setting, via [[Fisher kernel]]s.<ref>Thomas Hofmann, [~~http~~https://~~www~~papers.csnips.~~brown.edu~~cc/~~people~~paper/~~th/papers/Hofmann~~1654-learning-the-similarity-of-documents-an-information-geometric-approach-to-document-retrieval-and-~~NIPS99~~categorization.pspdf ''Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization''], [[Advances in Neural Information Processing Systems]] 12, pp-914-920, [[MIT Press]], 2000</ref> PLSA has applications in [[information retrieval]] and [[information filtering\|filtering]], [[natural language processing]], [[machine learning]] from text, [[bioinformatics]],<ref>{{Cite conference\|chapter=Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations\|conference=The 13th IEEE International Conference on BioInformatics and ~~related~~BioEngineering\|last1=Pinoli\|first1=Pietro\|last2=et\|first2=al.\|title= ~~areas~~Proceedings of IEEE BIBE 2013 \|date=2013\|publisher=IEEE\|pages=1–4\|language=en\|doi=10.1109/BIBE.2013.6701702\|isbn=978-147993163-7}} </ref> and related areas. It is reported that the [[aspect model]] used in the probabilistic latent semantic analysis has severe [[overfitting]] problems.<ref>{{cite journal\|title=Latent Dirichlet Allocation\|journal=Journal of Machine Learning Research\|year=2003\|first=David M.\|last=Blei\|author2=Andrew Y. Ng \|author3=Michael I. Jordan \|volume=3\|pages=993–1022~~\|id=~~ \|url=http://www.jmlr.~~csail.mit.edu~~org/papers/volume3/blei03a/blei03a.pdf\|doi=10.1162/jmlr.2003.3.4-5.993}}</ref> In 2012, pLSA has also been used in the [[bioinformatics]] context, for prediction of [[gene ontology]] biomolecular annotations.<ref>[http://home.dei.polimi.it/chicco/Wcci2012_DavideChicco_et_al.pdf ''"Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations"'', Marco Masseroli, Davide Chicco, Pietro Pinoli. IEEE WCCI 2012 - the 2012 IEEE World Congress on Computational Intelligence proceedings. Brisbane, Australia, June 2012. (.pdf)]</ref> ==Extensions == Line 29 ⟶ 28: Asymmetric: MASHA ("Multinomial ASymmetric Hierarchical Analysis")<ref>Alexei Vinokourov and Mark Girolami, [http://citeseer.ist.psu.edu/rd/30973750,455249,1,0.25,Download/http://citeseer.ist.psu.edu/cache/papers/cs/22961/http:zSzzSzcis.paisley.ac.ukzSzvino-ci0zSzvinokourov_masha.pdf/vinokourov02probabilistic.pdf A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections], in ''Information Processing and Management'', 2002</ref> Symmetric: HPLSA ("Hierarchical Probabilistic Latent Semantic Analysis")<ref>Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, [http://www.xrce.xerox.com/Research-Development/Publications/2002-004 A Hierarchical Model for Clustering and Categorising Documents] {{Webarchive\|url=https://web.archive.org/web/20160304033131/http://www.xrce.xerox.com/Research-Development/Publications/2002-004 \|date=2016-03-04 }}, in "Advances in Information Retrieval -- Proceedings of the 24th [[Information Retrieval Specialist Group\|BCS-IRSG]] European Colloquium on IR Research (ECIR-02)", 2002</ref> * Generative models: The following models have been developed to address an often-criticized shortcoming of PLSA, namely that it is not a proper generative model for new documents. Line 36 ⟶ 35: ==History== This is an example of a [[latent class model]] (see references therein), and it is related<ref>Chris Ding, Tao Li, Wei Peng (2006). "[http://www.aaai.org/Papers/AAAI/2006/AAAI06-055.pdf Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence Chi-Square Statistic, and a Hybrid Method. AAAI 2006" ]</ref><ref>Chris Ding, Tao Li, Wei Peng (2008). "[http://www.sciencedirect.com/science/article/pii/S0167947308000145 On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing"]</ref> to [[non-negative matrix factorization]]. The present terminology was coined in 1999 by [[Thomas Hofmann]].<ref>Thomas Hofmann, [~~http~~https://~~www~~arxiv.~~cs.brown.edu~~org/~~~th~~abs/~~papers/Hofmann-SIGIR99~~1301.~~pdf~~6705 ''Probabilistic Latent Semantic Indexing''], Proceedings of the Twenty-Second Annual International [[Special Interest Group on Information Retrieval\|SIGIR]] Conference on Research and Development in [[Information Retrieval]] (SIGIR-99), 1999</ref> == See also == Line 47 ⟶ 46: ==External links== [https://web.archive.org/web/20050120213347/http://www.cs.brown.edu/people/th/papers/Hofmann-UAI99.pdf Probabilistic Latent Semantic Analysis] [https://web.archive.org/web/20170717235351/http://www.semanticquery.com/archive/semanticsearchart/researchpLSA.html Complete PLSA DEMO in C#] {{DEFAULTSORT:Probabilistic Latent Semantic Analysis}} [[Category:Statistical natural language processing]] [[Category:~~Categorical~~Classification ~~data~~algorithms]] [[Category:Latent variable models]] [[Category:Language modeling]]