Text normalization: Difference between revisions

Content deleted Content added
No edit summary
Rangi42 (talk | contribs)
Added references and rewrote material.
Line 1:
{{Distinguish|word normalization|Unicode normalization}}
{{unreferenced|date=October 2007}}
 
'''Text normalization''' is athe process byof whichtransforming [[writing|text]] isinto transformeda insingle some[[canonical way toform]] makethat it consistentmight innot ahave wayhad whichbefore. Normalizing text before storing or processing it mightallows notfor have[[separation beenof concerns]], since input is guaranteed to be consistent before operations are performed on it. Text normalization isrequires oftenbeing performedaware beforeof what type of text is processedto inbe somenormalized way,and suchhow asit generatingis [[speechto synthesis|synthesizedbe speech]],processed [[automatedafterwards; languagethere translation]],is storageno inall-purpose anormalization [[database]],procedure.<ref orname="cs506">{{cite comparison.web
| title = CS506/606: Txt Nrmlztn
| author = Richard Sproat and Steven Bedrick
| date = September 2011
| accessdate = October 2, 2012
| url = http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/}}</ref>
 
== Applications==
Examples of text normalization:
 
Text normalization is frequently used when converting [[speech synthesis|text to speech]]. [[Number]]s, [[date]]s, [[acronym]]s, and [[abbreviation]]s are non-standard "words" that need to be pronounced differently depending on context.<ref name="sproate">Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorfk, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' '''15'''; 287–333. [[Digital object identifier|doi]]:[http://dx.doi.org/10.1006/csla.2001.0169 10.1006/csla.2001.0169].</ref> For example:
* [[Unicode equivalence|normalizing of Unicode]]
* converting all letters to lower or upper case
* converting numbers (dates, currencies, temperature) into words
* removing accent marks and other diacritics from letters
* removing punctuation
* expanding abbreviations
* removing [[stopwords]] or "too common" words
* [[stemming|word normalization]] (also known as stemming)
* text [[canonicalization]] (replacing words with their full equivalents, e.g. "co-operation" → "cooperation", "valour" → "valor", "should've" → "should have")
* removing repeating characters ("I looooove it!" → "I love it!")
 
* "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.<ref>{{cite web
While this may be done manually, and usually is in the case of [[ad hoc]] and personal documents, many [[programming language]]s support mechanisms which enable text normalization. These tasks also are not to be performed with blunt regular expressions, in some cases it might require dictionary and other linguistic resources
| title = Samoan Numbers
| work = MyLanguages.org
| accessdate = October 2, 2012
| url = http://mylanguages.org/samoan_numbers.php}}</ref>
* "vi" could be pronounced as "[[vi|vie]]," "[[Violet (name)|vee]]," or "[[Roman numerals|the sixth]]" depending on the surrounding words.<ref name="msdn">{{cite web
| title = Text-to-Speech Engines Text Normalization
| work = MSDN
| accessdate = October 2, 2012
| url = http://msdn.microsoft.com/en-us/library/ms699266(v=vs.85).aspx}}</ref>
 
Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing [[diacritical marks]]; and if "john" is to match "John", the text would be converted to a single [[letter case|case]]. To prepare text for searching, it might also be [[stemming|stemmed]] (e.g. converting "flew" and "flying" both into "fly"), [[Canonicalization|canonicalized]] (e.g. consistently using [[American and British English spelling differences|American or British English spelling]]), or have [[stop word]]s removed.
Text normalization is useful, for example, for comparing two sequences of characters which mean the same but are represented differently. It also is crucial for search engines and corpus management. The examples of this kind of normalization include, but not limited to, "don't" vs "do not", "I'm" vs "I am", "Can't" vs "Cannot". Further, "1" and "one" are the same, "1st" is the same as "first", and so on. Instead of treating these strings as different, through text processing, one can treat them as the same.
 
== Techniques ==
[[Category:Unicode]]
 
For simple, context-independent normalization, such as removing non-[[alphanumeric]] characters or [[diacritical marks]], [[regular expressions]] would suffice. For example, the [[sed]] script <tt>sed -e "s/\s+/ /g" ''inputfile''</tt> would normalize runs of [[whitespace character]]s into a single space. More complex normalization requires correspondingly complicated algorithms, including [[___domain knowledge]] of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text<ref name="tagging">Zhu, C.; Tang, J.; Li, H.; Ng , H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. [[Digital object identifier|doi]]:[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.8138 10.1.1.72.8138]].</ref> and as a special case of machine translation<ref name="mt">Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006). [http://www.proceedings2006.imcsit.org/pliks/202.pdf "Text Normalization as a Special Case of Machine Translation."] ''Proceedings of the International Multiconference on Computer Science and Information Technology'' '''1'''; 51–56.</ref>.
 
== References ==
 
{{Reflist}}
 
== See also ==
 
* [[Canonicalization]]
* [[Unicode equivalence|normalizing of Unicode]]
 
[[Category:Natural language processing]]
 
{{compu-sci-stub}}