Content deleted Content added
Citation bot (talk | contribs) Alter: journal, date. Add: isbn, pages, series, s2cid, pmid. | Use this bot. Report bugs. | Suggested by Headbomb | Linked from Wikipedia:WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Sandbox | #UCB_webform_linked 73/365 |
Adding local short description: "Table of terms in a collection of documents", overriding Wikidata description "matrix that describes the frequency of terms that occur in a collection of documents" |
||
(12 intermediate revisions by 9 users not shown) | |||
Line 1:
{{Short description|Table of terms in a collection of documents}}
{{More citations needed|date=January 2021}}▼
{{MI|
A '''document-term matrix''' is a mathematical [[Matrix (mathematics)|matrix]] that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a '''document-feature matrix''' where "features" may refer to other properties of a document besides terms.<ref>{{Cite web|title=Document-feature matrix :: Tutorials for quanteda|url=https://tutorials.quanteda.io/basic-operations/dfm/|access-date=2021-01-02|website=tutorials.quanteda.io}}</ref> It is also common to encounter the transpose, or '''term-document matrix''' where documents are the columns and terms are the rows. They are useful in the field of [[natural language processing]] and [[computational text analysis]].<ref>{{Cite web|title=15 Ways to Create a Document-Term Matrix in R|url=https://www.dustinstoltz.com/blog/2020/12/1/creating-document-term-matrix-comparison-in-r|access-date=2021-01-02|website=Dustin S. Stoltz|language=en-US}}</ref> ▼
▲ {{More citations needed|date=January 2021}}
{{Cleanup rewrite|it is very longwinded. The lead does not explain why this matrix would be needed|date=June 2025}}
}}
▲A '''document-term matrix''' is a mathematical [[Matrix (mathematics)|matrix]] that describes the frequency of terms that occur in
While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as
Terms are commonly single words separated by whitespace or punctuation on either side (a.k.a. unigrams). In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.
==General concept==
Line 18 ⟶ 22:
|'''D2'''||1||0||1||1
|}
which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document. For this reason, document-term matrices are usually stored in a sparse matrix format.
As a result of the power-law distribution of tokens in nearly every corpus (see [[Zipf's law]]), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use [[tf-idf]], which divides the term frequency by the term's document frequency.
Line 25 ⟶ 29:
The document-term matrix emerged in the earliest years of the computerization of text. The increasing capacity for storing documents created the problem of retrieving a given document in an efficient manner. While previously the work of classifying and indexing was accomplished by hand, researchers explored the possibility of doing this automatically using word frequency information.
One of the first published document-term matrices was in [[Harold Borko]]'s 1962 article "The construction of an empirically based mathematically derived classification system" (page 282, see also his 1965 article<ref>{{Cite journal|last=Borko|first=Harold|date=1965|title=A Factor Analytically Derived Classification System for Psychological Reports|url=http://dx.doi.org/10.2466/pms.1965.20.2.393|journal=Perceptual and Motor Skills|volume=20|issue=2|pages=393–406|doi=10.2466/pms.1965.20.2.393|pmid=14279310|s2cid=34230652|issn=0031-5125|url-access=subscription}}</ref>). Borko references two computer programs, "FEAT" which stood for "Frequency of Every Allowable Term," written by John C. Olney of the System Development Corporation and the Descriptor Word Index Program, written by [[Eileen Stone]] also of the System Development Corporation: <blockquote>Having selected the documents which were to make up the experimental library, the next step consisted of keypunching the entire body of text preparatory to computer processing. The program used for this analysis was FEAT (Frequency of Every Allowable Term). it was written by John C. Olney of the System Development Corporation and is designed to perform frequency and summary counts of individual words and of word pairs. The output of this program is an alphabetical listing, by frequency of occurrence, of all word types which appeared in the text. Certain function words such as and, the, at, a, etc., were placed in a "forbidden word list" table, and the frequency of these words was recorded in a separate listing... A special computer program, called the Descriptor Word Index Program, was written to provide this information and to prepare a document-term matrix in a form suitable for in-put to the Factor Analysis Program. The Descriptor Word Index program was prepared by Eileen Stone of the System Development Corporation.
==Choice of terms==
|