Document-term matrix: Difference between revisions

Content deleted Content added
StefanoTrv (talk | contribs)
m Spaces
FrescoBot (talk | contribs)
Line 18:
which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document.
 
As a result of the power-law distribution of tokens in nearly every corpus (see [[Zipf's law|Zipf's law)]]), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use [[tf-idf]], which divides the term frequency by the term's document frequency.
 
==Choice of Terms==