Document clustering: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 12:12, 15 April 2018 edit GermanJoe (talk \| contribs) Extended confirmed users 75,283 edits fix - layout / structure ← Previous edit		Latest revision as of 02:19, 10 January 2025 edit undo The Eloquent Peasant (talk \| contribs) Extended confirmed users, Pending changes reviewers 156,525 edits Importing Wikidata short description: "Grouping texts by similarity" Tag: Shortdesc helper
(6 intermediate revisions by 6 users not shown)
Line 1: {{Short description\|Grouping texts by similarity}} {{Multiple issues\| {{disputed\|date=March 2014}} {{more footnotes needed\|date=March 2014}} }} Line 9 ⟶ 10: Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users. The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. Text clustering may be used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents. In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the [[K-means algorithm]] and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the [[K-means algorithm]] are more efficient and provide sufficient information for most purposes.<ref name="manning">Manning, Chris, and Hinrich Schütze, ''Foundations of Statistical Natural Language Processing'', MIT Press. Cambridge, MA: May 1999.</ref>{{rp\|Ch.14}} These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a ~~document’s~~document's assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters.<ref name="manning"/>{{rp\|499}} [[Dimensionality reduction]] methods can be considered a subtype of soft clustering; for documents, these include [[latent semantic indexing]] ([[truncated singular value decomposition]] on term histograms)<ref>http://nlp.stanford.edu/IR-book/pdf/16flat.pdf {{Bare URL PDF\|date=March 2022}}</ref> and [[topic model]]s. Other algorithms involve graph based clustering, [[ontology (information science)\|ontology]] supported clustering and order sensitive clustering. Given a clustering, it can be beneficial to automatically derive human-readable labels for the clusters. [[Cluster labeling\|Various methods]] exist for this purpose. Line 54 ⟶ 55: ==See also== [[Cluster (disambiguation)\|Cluster]] [[Cluster Analysis]] [[Fuzzy clustering]] Line 65: Claudio Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. ACM Computing Surveys, Volume 41, Issue 3 (July 2009), Article No. 17, {{ISSN\|0360-0300}} *Wui Lee Chang, Kai Meng Tay, and Chee Peng Lim, A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization, Neural Processing Letters, DOI: 10.1007/s11063-017-9597-3. https://link.springer.com/article/10.1007/s11063-017-9597-3 {{Natural language processing}} [[Category:Information retrieval techniques]]