Content deleted Content added
Inaccurate information about the difference between clustering and classification. There are no distinct difference between the feature space in terms of clustering and classification. The main difference is whether we have a label variable. |
Add a new section sampling procedures to provide some guidelines on how to conduct document clustering in real world. Comments and editing are welcomed. |
||
Line 27:
* [http://FirstGov.gov FirstGov.gov], the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more.
* The Noggle search and clustering engine has clustered over 2000 TED Talks into automatically generated clusters. E.g. what had all TED talks from 2006-2016 in common about "happiness"? The results are available for further review.<ref>{{cite news|last1=von Thienen|first1=Lars|title=What would a robot see in TED talks?|url=https://www.noggle.online/knowledge-base/robot-see-ted-talks/|work=noggle.online|agency=TED.com}}</ref>
==Procedures==
In practice, document clustering often takes the following steps:
1. [[Tokenization_(lexical_analysis)|Tokenization]]
Tokenization is the process of parsing text data into smaller units (tokens) such as words and phrases. Commonly used tokenization methods include [[Bag-of-words model|Bag-of-words model]] and [[N-gram model|N-gram model]].
2. [[Stemming]] and [[lemmatization]]
Different tokens might carry out similar information (e.g. tokenizaiton and tokenizing). And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.
3. Removing [[stop words]] and [[punctuation]]
Some tokens are less important than others. For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text. So usually it is a good idea to eliminate stop words and punctuation marks before doing further analysis.
4. Computing term frequencies or [[tf-idf]]
After pre-processing the text data, we can then proceed to generate features. For document clustering, one of the most common ways to generate features for a document is to calculate the term frequencies of all its tokens. Although not perfect, these frequencies can usually provide some clues about the topic of the document. And sometimes it is also useful to weight the term frequencies by the inverse document frequencies. See [[tf-idf]] for detailed discussions.
5. Clustering
We can then cluster different documents based on the features we have generated. See the algorithm section in [[cluster analysis]] for different types of clustering methods.
6. Evaluation and visualization
Finally, the clustering models can be assessed by various metrics. And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space. See [[multidimensional scaling]] as a possible approach.
== Clustering v. Classifying ==
|