Revision as of 15:52, 20 March 2012 edit Soshial (talk \| contribs) 481 edits few additions ← Previous edit		Revision as of 06:51, 15 September 2012 edit undo 49.204.56.84 (talk) No edit summary Next edit →
Line 17: * removing repeating characters ("I looooove it!" → "I love it!") While this may be done manually, and usually is in the case of [[ad hoc]] and personal documents, many [[programming language]]s support mechanisms which enable text normalization. These tasks also are not to be performed with blunt regular expressions, in some cases it might require dictionary and other linguistic resources. Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. It also is crucial for search engines and corpus management. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot". Further, "1" and "one" are the same, "1st" is the same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as the same.

Text normalization: Difference between revisions