Revision as of 20:00, 2 January 2021 edit 2601:4a:600:1b94:2cc0:1d6:4c31:1f0f (talk) adding intro details ← Previous edit		Revision as of 20:18, 2 January 2021 edit undo 2601:4a:600:1b94:2cc0:1d6:4c31:1f0f (talk) specification of general concept Tag: Visual edit Next edit →
Line 1: {{Unreferenced stub\|auto=yes\|date=December 2009}} A '''document-term matrix''' is a mathematical [[Matrix (mathematics)\|matrix]] that describes the frequency of terms that occur in a collection of documents.In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a '''document-feature matrix''' where "features" may refer to other properties of a document besides terms.<ref>{{Cite web\|title=Document-feature matrix :: Tutorials for quanteda\|url=https://tutorials.quanteda.io/basic-operations/dfm/\|access-date=2021-01-02\|website=tutorials.quanteda.io}}</ref> It is also common to encounter the transpose, or '''term-document matrix''' where documents are the columns and terms are the rows. They are useful in the field of [[natural language processing]] and [[computational text analysis]].<ref>{{Cite web\|title=15 Ways to Create a Document-Term Matrix in R\|url=https://www.dustinstoltz.com/blog/2020/12/1/creating-document-term-matrix-comparison-in-r\|access-date=2021-01-02\|website=Dustin S. Stoltz\|language=en-US}}</ref> While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as relative frequency/proportions and [[tf-idf]]. Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document. ==General Concept== When creating a ~~database~~data-set of [[term (language)\|terms]] that appear in a ~~set~~corpus of [[document]]s, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ''ij'' cell, then, is the number of times word ''j'' occurs in document ''i''. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents: D1 = "I like databases" D2 = "I dislike databases", Line 14 ⟶ 16: \|'''D2'''\|\|1\|\|0\|\|1\|\|1 \|} which shows which documents contain which terms and how many times they appear. ~~Such~~Note anthat, ~~approach~~unlike isrepresenting ~~similar~~a todocument ~~the~~as ~~use~~just ofa ~~[[incidence~~token-count list, the document-term matrix]] byincludes anall ~~analysis~~terms ofin ~~sentences~~the ~~inside~~corpus (i.e. the corpus ofvocabulary), ~~words.<ref>Bryan~~which ~~Bischof.~~is ~~Higher~~why ~~order~~there ~~co-occurrence~~are ~~tensors~~zero-counts for ~~hypergraphs~~terms ~~via~~in ~~face-splitting.~~the ~~Published~~corpus 15which ~~February,~~do ~~2020,~~not ~~Mathematics,~~also ~~Computer~~occur ~~Science,~~in ~~[https://arxiv~~a specific document.~~org/abs/2002.06285 ArXiv]</ref>~~ As a result of the power-law distribution of tokens in nearly every corpus (see [[Zipf's law\|Zipf's law)]], it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use [[tf-idf]], which divides the term frequency by the inverse of the term's document frequency. ~~Note that more sophisticated weights can be used; one typical example, among others, would be [[tf-idf]].~~ ==Choice of Terms==

Document-term matrix: Difference between revisions