Content deleted Content added
Replaced {{unreferenced}} with {{more citations needed}} and other General fixes, removed stub tag |
StefanoTrv (talk | contribs) m Spaces |
||
Line 1:
{{More citations needed|date=January 2021}}
A '''document-term matrix''' is a mathematical [[Matrix (mathematics)|matrix]] that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.
Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.
Line 9:
*D2 = "I dislike databases",
then the document-term matrix would be:
{| border="1" cellspacing="0"
! ||I||like||dislike||databases
|- align=center
Line 21:
==Choice of Terms==
A point of view on the matrix is that each row represents a document. In the [[Vector space model|vectorial semantic model]], which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for [[Indo-European languages]], that nouns, verbs and adjectives are the more significant [[syntactic category|categories]], and that words from those categories should be kept as terms.
Adding [[collocation]] as terms improves the quality of the vectors, especially when computing similarities between documents.
|