Content deleted Content added
WP:CHECKWIKI error fix #86. External link with two brackets. Do general fixes and cleanup if needed using AWB (8512) |
|||
Line 17:
| accessdate = October 2, 2012
| url = http://mylanguages.org/samoan_numbers.php}}</ref>
* "vi" could be pronounced as "[[vi
| title = Text-to-Speech Engines Text Normalization
| work = MSDN
Line 27:
== Techniques ==
For simple, context-independent normalization, such as removing non-[[alphanumeric]] characters or [[diacritical marks]], [[regular expressions]] would suffice. For example, the [[sed]] script <tt>sed -e "s/\s+/ /g" ''inputfile''</tt> would normalize runs of [[whitespace character]]s into a single space. More complex normalization requires correspondingly complicated algorithms, including [[___domain knowledge]] of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text<ref name="tagging">Zhu, C.; Tang, J.; Li, H.; Ng , H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. [[Digital object identifier|doi]]:
== References ==
Line 39:
[[Category:Natural language processing]]
{{compu-sci-stub}}
|