Document-term matrix: Difference between revisions

Content deleted Content added
General Concept: Corrected description of tf-idf concept. (C.f. article on tf-idf).
BattyBot (talk | contribs)
Replaced {{unreferenced}} with {{more citations needed}} and other General fixes, removed stub tag
Line 1:
{{UnreferencedMore stub|auto=yescitations needed|date=DecemberJanuary 20092021}}
A '''document-term matrix''' is a mathematical [[Matrix (mathematics)|matrix]] that describes the frequency of terms that occur in a collection of documents.In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a '''document-feature matrix''' where "features" may refer to other properties of a document besides terms.<ref>{{Cite web|title=Document-feature matrix :: Tutorials for quanteda|url=https://tutorials.quanteda.io/basic-operations/dfm/|access-date=2021-01-02|website=tutorials.quanteda.io}}</ref> It is also common to encounter the transpose, or '''term-document matrix''' where documents are the columns and terms are the rows. They are useful in the field of [[natural language processing]] and [[computational text analysis]].<ref>{{Cite web|title=15 Ways to Create a Document-Term Matrix in R|url=https://www.dustinstoltz.com/blog/2020/12/1/creating-document-term-matrix-comparison-in-r|access-date=2021-01-02|website=Dustin S. Stoltz|language=en-US}}</ref> While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as relative frequency/proportions and [[tf-idf]].
 
Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.
 
==General Concept==
Line 29:
 
===Finding topics===
[[Multivariate analysis]] of the document-term matrix can reveal topics/themes of the corpus. Specifically, [[latent semantic analysis]] and [[data clustering]] can be used, and more recently [[probabilistic latent semantic analysis]] and [[non-negative matrix factorization]] have been found to perform well for this task.
 
==See also==
Line 38:
* [http://nlp.fi.muni.cz/projekty/gensim Gensim]: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations ([[tf-idf]], [[Latent semantic analysis|LSA]], [[Latent Dirichlet allocation|LDA]]).
 
== References ==
 
{{Reflist}}
 
{{DEFAULTSORT:Document-Term Matrix}}
[[Category:Natural language processing]]
 
== References ==
{{Reflist}}
 
{{Compu-AI-stub}}