Document clustering: Difference between revisions

Content deleted Content added
Line 9:
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
 
The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications.Text clustering may be used for different tasks, such as grouping similar documents (news, tweets, etc.) and the analysis of customer/employee feedback, discovering meaningful implicit subjects across all documents.
 
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the [[K-means algorithm]] and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the [[K-means algorithm]] are more efficient and provide sufficient information for most purposes.<ref name="manning">Manning, Chris, and Hinrich Schütze, ''Foundations of Statistical Natural Language Processing'', MIT Press. Cambridge, MA: May 1999.</ref>{{rp|Ch.14}}