Text normalization: Difference between revisions

Content deleted Content added
Ryanli (talk | contribs)
m add link
Hindi
Tags: Reverted references removed Mobile edit Mobile web edit
Line 8:
| url = http://www.csee.ogi.edu/~sproatr/Courses/TextNorm/}}</ref>
 
D.j naresh rawat
== Applications==
 
Text normalization is frequently used when converting [[speech synthesis|text to speech]]. [[Number]]s, [[Calendar date|date]]s, [[acronym]]s, and [[abbreviation]]s are non-standard "words" that need to be pronounced differently depending on context.<ref name="sproate">Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorfk, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' '''15'''; 287–333. [[Digital object identifier|doi]]:[https://dx.doi.org/10.1006/csla.2001.0169 10.1006/csla.2001.0169].</ref> For example:
 
* "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.<ref>{{cite web
| title = Samoan Numbers
| work = MyLanguages.org
| accessdate = October 2, 2012
| url = http://mylanguages.org/samoan_numbers.php}}</ref>
* "vi" could be pronounced as "[[vi]]e," "[[Violet (name)|vee]]," or "[[Roman numerals|the sixth]]" depending on the surrounding words.<ref name="msdn">{{cite web
| title = Text-to-Speech Engines Text Normalization
| work = MSDN
| accessdate = October 2, 2012
| url = http://msdn.microsoft.com/en-us/library/ms699266(v=vs.85).aspx}}</ref>
 
Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing [[diacritical marks]]; and if "john" is to match "John", the text would be converted to a single [[letter case|case]]. To prepare text for searching, it might also be [[stemming|stemmed]] (e.g. converting "flew" and "flying" both into "fly"), [[Canonicalization|canonicalized]] (e.g. consistently using [[American and British English spelling differences|American or British English spelling]]), or have [[stop word]]s removed.
 
== Techniques ==