Revision as of 15:24, 15 June 2016 edit Yuxiaosun (talk \| contribs) 21 edits Add a new section sampling procedures to provide some guidelines on how to conduct document clustering in real world. Comments and editing are welcomed. ← Previous edit		Revision as of 05:34, 16 June 2016 edit undo Yobot (talk \| contribs) Bots 4,733,870 edits m WP:CHECKWIKI error fixes using AWB (12023) Next edit →
Line 31: In practice, document clustering often takes the following steps: 1. [[~~Tokenization_~~Tokenization (~~lexical_analysis~~lexical analysis)\|Tokenization]] Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include [[~~Bag-of-words model\|~~Bag-of-words model]] and [[~~N-gram model\|~~N-gram model]]. 2. [[Stemming]] and [[lemmatization]] Line 41: 3. Removing [[stop words]] and [[punctuation]] Some tokens are less important than others. For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis. 4. Computing term frequencies or [[tf-idf]] Line 56: == Clustering v. Classifying == Clustering algorithms in computational text analysis groups documents into what are called subsets or ''clusters'' where the algorithm's goal is to create internally coherent clusters that are distinct from one another.<ref>{{Cite web\|url=http://nlp.stanford.edu/IR-book/\|title=Introduction to Information Retrieval\|website=nlp.stanford.edu\|pages=349\|access-date=2016-05-03}}</ref> Classification on the other hand, is a form of [[supervised learning]] where the features of the documents are used to predict the "type" of documents. == References ==

Document clustering: Difference between revisions