Revision as of 08:33, 11 June 2016 edit Vonthienen (talk \| contribs) 13 edits adding references for Clustering in search engines ← Previous edit		Revision as of 03:22, 12 June 2016 edit undo Yuxiaosun (talk \| contribs) 21 edits Inaccurate information about the difference between clustering and classification. There are no distinct difference between the feature space in terms of clustering and classification. The main difference is whether we have a label variable. Next edit →
Line 29: == Clustering v. Classifying == Clustering algorithms in computational text analysis groups documents into what are called subsets or ''clusters'' where the algorithm's goal is to create internally coherent clusters that are distinct from one another.<ref>{{Cite web\|url=http://nlp.stanford.edu/IR-book/\|title=Introduction to Information Retrieval\|website=nlp.stanford.edu\|pages=349\|access-date=2016-05-03}}</ref> Classification on the other hand, is a form of [[supervised learning]] where the individual coder creates internal, coherent clusters that are based on either [[Inductive reasoning\|inductive]], [[Deductive reasoning\|deductive]], or [[Abductive reasoning\|abductive]] reasoning. Clustering relies on no supervisory teacher imposing previously derived categories upon the data, just typesfeatures of ~~distances, of which~~ the ~~most~~documents ~~commonly~~are ~~found distance is [[Euclidean distance\|Euclidean]].<ref>{{Cite web\|url=http://nlp.stanford.edu/IR-book/\|title=Introduction~~used to ~~Information Retrieval\|website=nlp.stanford.edu\|pages=349–50\|access-date=2016-05-03}}</ref> Implementation~~predict the ~~system~~"type" of ~~document clustering using k-means algorithm, which makes faster searching of unstructured data as well as structured data~~documents.<ref>{{Cite journal\|last=Shewale\|first=\|date=April 2016\|title=DOCUMENT CLUSTERING USING K MEANS ALGORITHMS\|url=http://ijre.org/wp-content/uploads/2016/04/IJRE_DOCUMENT_CLUSTERING_USING_K_MEANS_ALGORITHMS_30431.pdf\|journal=International Journal of Research and ~~Engineering\|doi=\|pmid=\|access-date=}}</ref>~~ Clustering algorithms rely on [[Latent semantic analysis\|Latent Semantic Analysis]] where the documents being analyzed are treated as a "[[Bag-of-words model\|bag of words]]," or multidimensional spaces rather than [[vector space]]s ([[classification]]). Examples of such spaces are measured in terms of both "closeness" and "distance," where ''hierarchical'' and ''flat'' clustering structures are modeled, or ''soft and hard.'' In classification methods, algorithms measure the distances between vector spaces formed between documents.<ref>{{Cite journal\|last=Grimmer\|first=Justin\|last2=King\|first2=Gary\|date=2011-02-15\|title=General purpose computer-assisted clustering and conceptualization\|url=http://www.pnas.org/content/108/7/2643\|journal=Proceedings of the National Academy of Sciences\|language=en\|volume=108\|issue=7\|pages=2643–2650\|doi=10.1073/pnas.1018067108\|issn=0027-8424\|pmc=3041127\|pmid=21292983}}</ref> == References ==

Document clustering: Difference between revisions