Revision as of 01:57, 2 November 2009 edit Danielx (talk \| contribs) 263 edits improved the intro based on the description found in http://en.wikipedia.org/wiki/Latent_semantic_analysis#Occurrence_matrix ← Previous edit		Revision as of 10:31, 17 December 2009 edit undo SmackBot (talk \| contribs) 3,734,324 edits m remove Erik9bot category,outdated, tag and general fixes Next edit →
Line 1: {{Unreferenced stub\|auto=yes\|date=December 2009}} '''Document-term matrix''' is a mathematical [[Matrix (mathematics)\|matrix]] that describes the frequency of terms that occur in a collection of documents. Each column corresponds to a document in the collection, and each row corresponds to a word or term. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is [[tf-idf]]. They are useful in the field of [[natural language processing]]. Line 25 ⟶ 26: [[Latent semantic analysis]] (performing [[eigenvalue decomposition]] on the document-term matrix) can improve search results by [[disambiguation\|disambiguating]] [[polysemy\|polysemous words]] and searching for [[synonym]]s of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard [[trie]] data structure of search engines. === Finding topics === [[Multivariate analysis]] of the document-term matrix can reveal topics/themes of the corpus. Specifically, [[latent semantic analysis]] and [[data clustering]] can be used, and more recently [[probabilistic latent semantic analysis]] and [[non-negative matrix factorization]] have been found to perform well for this task. == See also == * [[Bag of words model]] {{DEFAULTSORT:Document-Term Matrix}} [[Category:Natural language processing]]▼ {{compu-AI-stub}}▼ ▲{{~~compu~~Compu-AI-stub}} ▲[[Category:Natural language processing]] ~~[[Category:Articles lacking sources (Erik9bot)]]~~

Document-term matrix: Difference between revisions