Content deleted Content added
improved the intro based on the description found in http://en.wikipedia.org/wiki/Latent_semantic_analysis#Occurrence_matrix |
m remove Erik9bot category,outdated, tag and general fixes |
||
Line 1:
{{Unreferenced stub|auto=yes|date=December 2009}}
'''Document-term matrix''' is a mathematical [[Matrix (mathematics)|matrix]] that describes the frequency of terms that occur in a collection of documents. Each column corresponds to a document in the collection, and each row corresponds to a word or term. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is [[tf-idf]]. They are useful in the field of [[natural language processing]].
Line 25 ⟶ 26:
[[Latent semantic analysis]] (performing [[eigenvalue decomposition]] on the document-term matrix) can improve search results by [[disambiguation|disambiguating]] [[polysemy|polysemous words]] and searching for [[synonym]]s of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard [[trie]] data structure of search engines.
===
[[Multivariate analysis]] of the document-term matrix can reveal topics/themes of the corpus. Specifically, [[latent semantic analysis]] and [[data clustering]] can be used, and more recently [[probabilistic latent semantic analysis]] and [[non-negative matrix factorization]] have been found to perform well for this task.
==
* [[Bag of words model]]
{{DEFAULTSORT:Document-Term Matrix}}
[[Category:Natural language processing]]▼
{{compu-AI-stub}}▼
▲[[Category:Natural language processing]]
|