Revision as of 15:28, 5 June 2021 edit OpenNotes1 (talk \| contribs) Extended confirmed users 2,834 edits {{Natural language processing}} ← Previous edit		Revision as of 06:13, 13 March 2022 edit undo BrownHairedGirl (talk \| contribs) Autopatrolled, Extended confirmed users, File movers, Pending changes reviewers, Rollbackers 2,942,733 edits tag with {{Bare URL PDF}} Tag: AWB Next edit →
Line 13: In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the [[K-means algorithm]] and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the [[K-means algorithm]] are more efficient and provide sufficient information for most purposes.<ref name="manning">Manning, Chris, and Hinrich Schütze, ''Foundations of Statistical Natural Language Processing'', MIT Press. Cambridge, MA: May 1999.</ref>{{rp\|Ch.14}} These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters.<ref name="manning"/>{{rp\|499}} [[Dimensionality reduction]] methods can be considered a subtype of soft clustering; for documents, these include [[latent semantic indexing]] ([[truncated singular value decomposition]] on term histograms)<ref>http://nlp.stanford.edu/IR-book/pdf/16flat.pdf {{Bare URL PDF\|date=March 2022}}</ref> and [[topic model]]s. Other algorithms involve graph based clustering, [[ontology (information science)\|ontology]] supported clustering and order sensitive clustering.

Document clustering: Difference between revisions