Content deleted Content added
No edit summary |
Added references and rewrote material. |
||
Line 1:
{{Distinguish|word normalization|Unicode normalization}}
'''Text normalization''' is
| title = CS506/606: Txt Nrmlztn
| author = Richard Sproat and Steven Bedrick
| date = September 2011
| accessdate = October 2, 2012
| url = http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/}}</ref>
== Applications==
Text normalization is frequently used when converting [[speech synthesis|text to speech]]. [[Number]]s, [[date]]s, [[acronym]]s, and [[abbreviation]]s are non-standard "words" that need to be pronounced differently depending on context.<ref name="sproate">Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorfk, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' '''15'''; 287–333. [[Digital object identifier|doi]]:[http://dx.doi.org/10.1006/csla.2001.0169 10.1006/csla.2001.0169].</ref> For example:
* [[Unicode equivalence|normalizing of Unicode]]▼
* "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.<ref>{{cite web
| title = Samoan Numbers
| work = MyLanguages.org
| accessdate = October 2, 2012
| url = http://mylanguages.org/samoan_numbers.php}}</ref>
* "vi" could be pronounced as "[[vi|vie]]," "[[Violet (name)|vee]]," or "[[Roman numerals|the sixth]]" depending on the surrounding words.<ref name="msdn">{{cite web
| title = Text-to-Speech Engines Text Normalization
| work = MSDN
| accessdate = October 2, 2012
| url = http://msdn.microsoft.com/en-us/library/ms699266(v=vs.85).aspx}}</ref>
Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing [[diacritical marks]]; and if "john" is to match "John", the text would be converted to a single [[letter case|case]]. To prepare text for searching, it might also be [[stemming|stemmed]] (e.g. converting "flew" and "flying" both into "fly"), [[Canonicalization|canonicalized]] (e.g. consistently using [[American and British English spelling differences|American or British English spelling]]), or have [[stop word]]s removed.
== Techniques ==
For simple, context-independent normalization, such as removing non-[[alphanumeric]] characters or [[diacritical marks]], [[regular expressions]] would suffice. For example, the [[sed]] script <tt>sed -e "s/\s+/ /g" ''inputfile''</tt> would normalize runs of [[whitespace character]]s into a single space. More complex normalization requires correspondingly complicated algorithms, including [[___domain knowledge]] of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text<ref name="tagging">Zhu, C.; Tang, J.; Li, H.; Ng , H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. [[Digital object identifier|doi]]:[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.8138 10.1.1.72.8138]].</ref> and as a special case of machine translation<ref name="mt">Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006). [http://www.proceedings2006.imcsit.org/pliks/202.pdf "Text Normalization as a Special Case of Machine Translation."] ''Proceedings of the International Multiconference on Computer Science and Information Technology'' '''1'''; 51–56.</ref>.
== References ==
{{Reflist}}
== See also ==
* [[Canonicalization]]
[[Category:Natural language processing]]
{{compu-sci-stub}}
|