Content deleted Content added
Add a new section sampling procedures to provide some guidelines on how to conduct document clustering in real world. Comments and editing are welcomed. |
m WP:CHECKWIKI error fixes using AWB (12023) |
||
Line 31:
In practice, document clustering often takes the following steps:
1. [[
Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include [[
2. [[Stemming]] and [[lemmatization]]
Line 41:
3. Removing [[stop words]] and [[punctuation]]
Some tokens are less important than others. For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.
4. Computing term frequencies or [[tf-idf]]
Line 56:
== Clustering v. Classifying ==
Clustering algorithms in computational text analysis groups documents into what are called subsets or ''clusters'' where the algorithm's goal is to create internally coherent clusters that are distinct from one another.<ref>{{Cite web|url=http://nlp.stanford.edu/IR-book/|title=Introduction to Information Retrieval|website=nlp.stanford.edu|pages=349|access-date=2016-05-03}}</ref> Classification on the other hand, is a form of [[supervised learning]] where the features of the documents are used to predict the "type" of documents.
== References ==
|