Revision as of 10:24, 25 January 2021 edit 87.162.170.100 (talk) →General Concept: Corrected description of tf-idf concept. (C.f. article on tf-idf). ← Previous edit		Revision as of 11:56, 26 January 2021 edit undo BattyBot (talk \| contribs) Bots 1,957,439 edits Replaced {{unreferenced}} with {{more citations needed}} and other General fixes, removed stub tag Tag: AWB Next edit →
Line 1: {{~~Unreferenced~~More ~~stub\|auto=yes~~citations needed\|date=~~December~~January ~~2009~~2021}} A '''document-term matrix''' is a mathematical [[Matrix (mathematics)\|matrix]] that describes the frequency of terms that occur in a collection of documents.In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a '''document-feature matrix''' where "features" may refer to other properties of a document besides terms.<ref>{{Cite web\|title=Document-feature matrix :: Tutorials for quanteda\|url=https://tutorials.quanteda.io/basic-operations/dfm/\|access-date=2021-01-02\|website=tutorials.quanteda.io}}</ref> It is also common to encounter the transpose, or '''term-document matrix''' where documents are the columns and terms are the rows. They are useful in the field of [[natural language processing]] and [[computational text analysis]].<ref>{{Cite web\|title=15 Ways to Create a Document-Term Matrix in R\|url=https://www.dustinstoltz.com/blog/2020/12/1/creating-document-term-matrix-comparison-in-r\|access-date=2021-01-02\|website=Dustin S. Stoltz\|language=en-US}}</ref> While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as relative frequency/proportions and [[tf-idf]]. Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document. ==General Concept== Line 29: ===Finding topics=== [[Multivariate analysis]] of the document-term matrix can reveal topics/themes of the corpus. Specifically, [[latent semantic analysis]] and [[data clustering]] can be used, and more recently [[probabilistic latent semantic analysis]] and [[non-negative matrix factorization]] have been found to perform well for this task. ==See also== Line 38: * [http://nlp.fi.muni.cz/projekty/gensim Gensim]: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations ([[tf-idf]], [[Latent semantic analysis\|LSA]], [[Latent Dirichlet allocation\|LDA]]). == References ==▼ {{Reflist}}▼ {{DEFAULTSORT:Document-Term Matrix}} [[Category:Natural language processing]] ▲== References == ▲{{Reflist}} ~~{{Compu-AI-stub}}~~

Document-term matrix: Difference between revisions