Document clustering: Difference between revisions

Content deleted Content added
Remove spam links
No edit summary
Line 37:
2. [[Stemming]] and [[lemmatization]]
 
Different tokens might carry out similar information (e.g. tokenizaitontokenization and tokenizing). And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.
 
3. Removing [[stop words]] and [[punctuation]]