Revision as of 23:24, 3 June 2016 edit Christopheruller (talk \| contribs) 54 edits m hyperlinks Tag: Visual edit ← Previous edit		Revision as of 04:43, 4 June 2016 edit undo Yobot (talk \| contribs) Bots 4,733,870 edits m Removed invisible unicode characters + other fixes, replaced: → using AWB (12020) Next edit →
Line 11: The application of document clustering can be categorized to two types, online and offline. Online applications are usually constrained by efficiency problems when compared to offline applications. In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method. By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from efficiency problems. The other algorithm is developed using the [[K-means algorithm]] and its variants. Generally hierarchical algorithms produce more in-depth information for detailed analyses, while algorithms based around variants of the [[K-means algorithm]] is more efficient and provides sufficient information for most purposes.<ref>Manning, Chris, and Hinrich Schütze,''''' Foundations of Statistical Natural Language Processing'Italic text''', MIT Press. Cambridge, MA: May 1999. Chapter 14'</ref> . These algorithms can further be classified as hard or soft clustering algorithms. Hard clustering computes a hard assignment – each document is a member of exactly one cluster. The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. In a soft assignment, a document has fractional membership in several clusters.<ref>Manning, Chris, and Hinrich Schütze,''''' Foundations of Statistical Natural Language Processing'Italic text''', MIT Press. Cambridge, MA: May 1999. Pg 499'</ref>. [[Dimensionality reduction]] methods can be considered a subtype of soft clustering; for documents, these include [[latent semantic indexing]] ([[truncated singular value decomposition]] on term histograms)<ref>http://nlp.stanford.edu/IR-book/pdf/16flat.pdf</ref> and [[topic model]]s. Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering. Line 28: == Clustering v. Classifying == Clustering algorithms in computational text analysis groups documents into what are called subsets or ''clusters'' where the algorithm's goal is to create internally coherent clusters that are distinct from one another.<ref>{{Cite web\|url=http://nlp.stanford.edu/IR-book/\|title=Introduction to Information Retrieval\|website=nlp.stanford.edu\|pages=349\|access-date=2016-05-03}}</ref> Classification on the other hand, is a form of [[supervised learning]] where the individual coder creates internal, coherent clusters that are based on either [[Inductive reasoning\|inductive]], [[Deductive reasoning\|deductive]], or [[Abductive reasoning\|abductive]] reasoning. Clustering relies on no supervisory teacher imposing previously derived categories upon the data, just types of distances, of which the most commonly found distance is [[Euclidean distance\|Euclidean]].<ref>{{Cite web\|url=http://nlp.stanford.edu/IR-book/\|title=Introduction to Information Retrieval\|website=nlp.stanford.edu\|pages=349–50\|access-date=2016-05-03}}</ref> Implementation the system of document clustering using k-means algorithm, which makes faster searching of unstructured data as well as structured data.<ref>{{Cite journal\|last=Shewale\|first=\|date=April 2016\|title=DOCUMENT CLUSTERING USING K MEANS ALGORITHMS\|url=http://ijre.org/wp-content/uploads/2016/04/IJRE_DOCUMENT_CLUSTERING_USING_K_MEANS_ALGORITHMS_30431.pdf\|journal=International Journal of Research and Engineering\|doi=\|pmid=\|access-date=}}</ref> Clustering algorithms rely on [[Latent semantic analysis\|Latent Semantic Analysis]] where the documents being analyzed are treated as a "[[Bag-of-words model\|bag of words]]," or multidimensional spaces rather than [[~~Vector~~vector space~~\|vector spaces~~]]s ([[classification]]). Examples of such spaces are measured in terms of both "closeness" and "distance," where ''hierarchical'' and ''flat'' clustering structures are modeled, or ''soft and hard.'' In classification methods, algorithms measure the distances between vector spaces formed between documents.<ref>{{Cite journal\|last=Grimmer\|first=Justin\|last2=King\|first2=Gary\|date=2011-02-15\|title=General purpose computer-assisted clustering and conceptualization\|url=http://www.pnas.org/content/108/7/2643\|journal=Proceedings of the National Academy of Sciences\|language=en\|volume=108\|issue=7\|pages=2643–2650\|doi=10.1073/pnas.1018067108\|issn=0027-8424\|pmc=3041127\|pmid=21292983}}</ref>. == References ==

Document clustering: Difference between revisions