Content deleted Content added
→Finding topics: +LDA |
sentence case for section titles per MOS:SECTIONS |
||
Line 4:
Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.
==General
When creating a data-set of [[term (language)|terms]] that appear in a corpus of [[document]]s, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ''ij'' cell, then, is the number of times word ''j'' occurs in document ''i''. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents:
*D1 = "I like databases"
Line 20:
As a result of the power-law distribution of tokens in nearly every corpus (see [[Zipf's law]]), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use [[tf-idf]], which divides the term frequency by the term's document frequency.
==Choice of
A point of view on the matrix is that each row represents a document. In the [[Vector space model|vectorial semantic model]], which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for [[Indo-European languages]], that nouns, verbs and adjectives are the more significant [[syntactic category|categories]], and that words from those categories should be kept as terms.
Adding [[collocation]] as terms improves the quality of the vectors, especially when computing similarities between documents.
|