Revision as of 11:56, 26 January 2021 edit BattyBot (talk \| contribs) Bots 1,957,439 edits Replaced {{unreferenced}} with {{more citations needed}} and other General fixes, removed stub tag Tag: AWB ← Previous edit		Revision as of 17:11, 12 February 2021 edit undo StefanoTrv (talk \| contribs) 106 edits m Spaces Next edit →
Line 1: {{More citations needed\|date=January 2021}} A '''document-term matrix''' is a mathematical [[Matrix (mathematics)\|matrix]] that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a '''document-feature matrix''' where "features" may refer to other properties of a document besides terms.<ref>{{Cite web\|title=Document-feature matrix :: Tutorials for quanteda\|url=https://tutorials.quanteda.io/basic-operations/dfm/\|access-date=2021-01-02\|website=tutorials.quanteda.io}}</ref> It is also common to encounter the transpose, or '''term-document matrix''' where documents are the columns and terms are the rows. They are useful in the field of [[natural language processing]] and [[computational text analysis]].<ref>{{Cite web\|title=15 Ways to Create a Document-Term Matrix in R\|url=https://www.dustinstoltz.com/blog/2020/12/1/creating-document-term-matrix-comparison-in-r\|access-date=2021-01-02\|website=Dustin S. Stoltz\|language=en-US}}</ref> While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as relative frequency/proportions and [[tf-idf]]. Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document. Line 9: *D2 = "I dislike databases", then the document-term matrix would be: {\| border="1" cellspacing="0" ! \|\|I\|\|like\|\|dislike\|\|databases \|- align=center Line 21: ==Choice of Terms== A point of view on the matrix is that each row represents a document. In the [[Vector space model\|vectorial semantic model]], which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for [[Indo-European languages]], that nouns, verbs and adjectives are the more significant [[syntactic category\|categories]], and that words from those categories should be kept as terms. Adding [[collocation]] as terms improves the quality of the vectors, especially when computing similarities between documents.

Document-term matrix: Difference between revisions