Document-term matrix: Difference between revisions

Content deleted Content added
No edit summary
Janislaw (talk | contribs)
Line 18:
 
==Choice of Terms==
A point of view on the matrix is that each row represents a document. In the [[Vector space model|vectorial semantic model]] which is normally the one used when computing a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for [[Indo-European languages]], that nouns, verbs and adjectives are the more significant [[syntactic category|categories]] , and that words from those categories should be kept as terms.
Adding [[collocation]] as terms improves the quality of the vectors, especially when computing similarities between documents.