Document clustering: Difference between revisions

Content deleted Content added
m I added a brief description to the section under Classifying v. Clustering which lays out basic algorithmic definitions for each. I also included a citation for this entry. The citation is by Justin Grimmer and Gary King.
Importing Wikidata short description: "Grouping texts by similarity"
 
(28 intermediate revisions by 19 users not shown)
Line 1:
{{Short description|Grouping texts by similarity}}
{{Multiple issues|
{{disputed|date=March 2014}}
{{more footnotes needed|date=March 2014}}
}}
 
Line 9 ⟶ 10:
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
 
The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.
 
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the [[K-means algorithm]] and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the [[K-means algorithm]] isare more efficient and providesprovide sufficient information for most purposes.<ref name="manning">Manning, Chris, and Hinrich Schütze, ''''' Foundations of Statistical Natural Language Processing'Italic text''', MIT Press. Cambridge, MA: May 1999. Chapter 14'</ref> {{rp|Ch.14}}
 
These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’sdocument's assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters.<ref>Manning, Chris, and Hinrich Schütze,''''' Foundations of Statistical Natural Language Processing'Italic text''', MIT Press. Cambridge, MA: May 1999. Pg 499'<name="manning"/ref>.{{rp|499}} [[Dimensionality reduction]] methods can be considered a subtype of soft clustering; for documents, these include [[latent semantic indexing]] ([[truncated singular value decomposition]] on term histograms)<ref>http://nlp.stanford.edu/IR-book/pdf/16flat.pdf {{Bare URL PDF|date=March 2022}}</ref> and [[topic model]]s.
 
Other algorithms involve graph based clustering, [[ontology (information science)|ontology]] supported clustering and order sensitive clustering.
 
Given a clustering, it can be beneficial to automatically derive human-readable labels for the clusters. [[Cluster labeling|Various methods]] exist for this purpose.
 
==Clustering in search engines==
A [[web search engine]] often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as [[Northern Light Group|Northern Light]] and [[Vivisimo]], consumer search engines such as [http://www.polymeta.com/ PolyMeta] and [http://www.helioid.com Helioid], or open source software such as [[Carrot2]].
 
==Procedures==
Examples:
In practice, document clustering often takes the following steps:
1. [[Tokenization (lexical analysis)|Tokenization]]
 
Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include [[Bag-of-words model]] and [[N-gram model]].
* Clustering divides the results of a search for "cell" into groups like "biology," "battery," and "prison."
 
* [http://FirstGov.gov FirstGov.gov], the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more.
2. [[Stemming]] and [[lemmatization]]
 
Different tokens might carry out similar information (e.g. tokenization and tokenizing). And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.
 
3. Removing [[stop words]] and [[punctuation]]
 
Some tokens are less important than others. For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.
 
4. Computing term frequencies or [[tf-idf]]
 
After pre-processing the text data, we can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about the topic of the document. And sometimes it is also useful to weight the term frequencies by the inverse document frequencies. See [[tf-idf]] for detailed discussions.
 
5. Clustering
 
We can then cluster different documents based on the features we have generated. See the algorithm section in [[cluster analysis]] for different types of clustering methods.
 
6. Evaluation and visualization
 
Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space. See [[multidimensional scaling]] as a possible approach.
 
== Clustering v. Classifying ==
Clustering algorithms in computational text analysis groups documents into grouping a set of text what are called subsets or ''clusters'' where the algorithm's goal is to create internally coherent clusters that are distinct from one another.<ref>{{Cite web|url=http://nlp.stanford.edu/IR-book/|title=Introduction to Information Retrieval|website=nlp.stanford.edu|pages=349|access-date=2016-05-03}}</ref> Classification on the other hand, is a form of [[supervised learning]] where the individual coder creates internal, coherent clusters that are based on either [[Inductive reasoning|inductive]], [[Deductive reasoning|deductive]], or [[Abductive reasoning|abductive]] reasoning. Clustering relies on no supervisory teacher imposing previously derived categories upon the data, just typesfeatures of distances, of which the mostdocuments commonlyare found distance is [[Euclidean distance|Euclidean]].<ref>{{Cite web|url=http://nlp.stanford.edu/IR-book/|title=Introductionused to Information Retrieval|website=nlp.stanford.edu|pages=349–50|access-date=2016-05-03}}</ref>  Implementationpredict the system"type" of document clustering using k-means algorithm, which makes faster searching of unstructured data as well as structured datadocuments.<ref>{{Cite journal|last=Shewale|first=|date=April 2016|title=DOCUMENT CLUSTERING USING K MEANS ALGORITHMS|url=http://ijre.org/wp-content/uploads/2016/04/IJRE_DOCUMENT_CLUSTERING_USING_K_MEANS_ALGORITHMS_30431.pdf|journal=International Journal of Research and Engineering|doi=|pmid=|access-date=}}</ref>
 
==See also==
Clustering algorithms rely on Latent Semantic Analysis where the documents being analyzed are treated as a "bag of words," or multidimensional spaces rather than vector spaces (classification). Examples of such spaces are measured in terms of both "closeness" and "distance," where ''hierarchical'' and ''flat'' clustering structures are modeled, or ''soft and hard.'' In classification methods, algorithms measure the distances between vector spaces formed between documents<ref>{{Cite journal|last=Grimmer|first=Justin|last2=King|first2=Gary|date=2011-02-15|title=General purpose computer-assisted clustering and conceptualization|url=http://www.pnas.org/content/108/7/2643|journal=Proceedings of the National Academy of Sciences|language=en|volume=108|issue=7|pages=2643–2650|doi=10.1073/pnas.1018067108|issn=0027-8424|pmc=3041127|pmid=21292983}}</ref>.
*[[Cluster (disambiguation)|Cluster]]
*[[Fuzzy clustering]]
 
== References ==
{{reflist}}
 
Publications:
== Bibliography ==
* Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. ''Flat Clustering'' in <u>Introduction to Information Retrieval.</u> Cambridge University Press. 2008
* Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, October 16, 2007 [http://eprints.cs.vt.edu/archive/00001000/01/docclust.pdf]
* Claudio Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. ACM Computing Surveys, Volume 41, Issue 3 (July 2009), Article No. 17, {{ISSN|0360-0300}}
*Wui Lee Chang, Kai Meng Tay, and Chee Peng Lim, A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization, Neural Processing Letters, DOI: 10.1007/s11063-017-9597-3. https://link.springer.com/article/10.1007/s11063-017-9597-3
* http://semanticquery.com/archive/semanticsearchart/researchBest.html - comparison of several popular clustering algorithms, data and software to reproduce the result.
* Tanmay Basu, C.A. Murthy, CUES: A New Hierarchical Approach for Document Clustering, 2013 [http:]
 
{{Natural language processing}}
==See also==
*[[Cluster Analysis]]
*[[Fuzzy clustering]]
 
[[Category:Information retrieval techniques]]