Content deleted Content added
few additions |
No edit summary |
||
Line 17:
* removing repeating characters ("I looooove it!" → "I love it!")
While this may be done manually, and usually is in the case of [[ad hoc]] and personal documents, many [[programming language]]s support mechanisms which enable text normalization. These tasks also are not to be performed with blunt regular expressions, in some cases it might require dictionary and other linguistic resources
Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. It also is crucial for search engines and corpus management. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot". Further, "1" and "one" are the same, "1st" is the same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as the same.
|