Text normalization: Difference between revisions

Content deleted Content added
Soshial (talk | contribs)
added 1 other example
Soshial (talk | contribs)
expanding
Line 7:
* [[Unicode equivalence|normalizing of Unicode]]
* converting all letters to lower or upper case
* converting numbers (dates, currencies, temperature) into words
* removing accent marks and other diacritics from letters
* removing punctuation
* converting numbers into words
* removing accent marks and other diacritics from letters
* expanding abbreviations
* removing [[stopwords]] or "too common" words
* [[stemming]]
* text [[canonicalization]] (replacing words with their full equivalents, e.g. "co-operation" → "cooperation", "valour" → "valor", "should've" → "should have")
* [[canonicalization]]
* removing repeating characters ("I looooove it!" → "I love it!")
 
While this may be done manually, and usually is in the case of ad hoc and personal documents, many [[programming language]]s support mechanisms which enable text normalization.
 
While this may be done manually, and usually is in the case of [[ad hoc]] and personal documents, many [[programming language]]s support mechanisms which enable text normalization. These tasks also are not to be performed with blunt regular expressions, in some cases it might require dictionary and other linguistic resources.
Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot".
 
Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. It also is crucial for search engines and corpus management. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot". Further, "1" and "one" are the same, "1st" is the same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as the same.
 
[[Category:Unicode]]