Document-term matrix: Difference between revisions

Content deleted Content added
Danielx (talk | contribs)
improved the intro based on the description found in http://en.wikipedia.org/wiki/Latent_semantic_analysis#Occurrence_matrix
SmackBot (talk | contribs)
m remove Erik9bot category,outdated, tag and general fixes
Line 1:
{{Unreferenced stub|auto=yes|date=December 2009}}
'''Document-term matrix''' is a mathematical [[Matrix (mathematics)|matrix]] that describes the frequency of terms that occur in a collection of documents. Each column corresponds to a document in the collection, and each row corresponds to a word or term. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is [[tf-idf]]. They are useful in the field of [[natural language processing]].
 
Line 25 ⟶ 26:
[[Latent semantic analysis]] (performing [[eigenvalue decomposition]] on the document-term matrix) can improve search results by [[disambiguation|disambiguating]] [[polysemy|polysemous words]] and searching for [[synonym]]s of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard [[trie]] data structure of search engines.
 
=== Finding topics ===
[[Multivariate analysis]] of the document-term matrix can reveal topics/themes of the corpus. Specifically, [[latent semantic analysis]] and [[data clustering]] can be used, and more recently [[probabilistic latent semantic analysis]] and [[non-negative matrix factorization]] have been found to perform well for this task.
 
== See also ==
* [[Bag of words model]]
 
{{DEFAULTSORT:Document-Term Matrix}}
[[Category:Natural language processing]]
 
{{compu-AI-stub}}
 
{{compuCompu-AI-stub}}
[[Category:Natural language processing]]
[[Category:Articles lacking sources (Erik9bot)]]