Text normalization: Difference between revisions

Content deleted Content added
Techniques: added section on textual scholarship
See also: annlink
 
(7 intermediate revisions by 7 users not shown)
Line 1:
{{shortShort description|processProcess of transforming text into a single canonical form}}
{{Use American English|date=March 2021}}
{{Use mdy dates|date=March 2021}}
Line 13:
== Applications==
 
Text normalization is frequently used when converting [[speech synthesis|text to speech]]. [[Number]]s, [[Calendar date|date]]s, [[acronym]]s, and [[abbreviation]]s are non-standard "words" that need to be pronounced differently depending on context.<ref name="sproate">Sproat, R.; Black, A.; Chen, S.; Kumar, S.; OstendorfkOstendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' '''15'''; 287–333. [[Digital object identifier|doi]]:[https://dx.doi.org/10.1006/csla.2001.0169 10.1006/csla.2001.0169].</ref> For example:
 
* "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.<ref>{{cite web
Line 20:
| accessdate = October 2, 2012
| url = http://mylanguages.org/samoan_numbers.php}}</ref>
* "vi" could be pronounced as "[[viViolet (name)|vie]]e," "[[VioletVi (nametext editor)|vee]]," or "[[Roman numerals|the sixth]]" depending on the surrounding words.<ref name="msdn">{{cite web
| title = Text-to-Speech Engines Text Normalization
| work = MSDN
Line 33:
 
==Textual scholarship==
In the field of [[textual scholarship]] and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of [[scribal abbreviation]]s and the transliteration of the archaic [[glyph]]s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a ''[[Diplomatics#Diplomatic editions and transcription|''diplomatic edition]]'' (or ''semi-diplomatic'') ''edition'']]), in which some attempt is made to preserve these features. HoweverThe aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary:. someSome editors, for example, choose to modernize archaic spellings and punctuation, but others do not.<ref>{{cite book |first=P. D. A. |last=Harvey |title=Editing Historical Records |publisher=British Library |place=London |year=2001 |isbn=0-7123-4684-8 |pages=40–46 }}</ref>
 
== See also ==
* [[{{annotated link|Automated paraphrasing]]}}
* [[{{annotated link|Canonicalization]]}}
* [[{{annotated link|Text simplification]]}}
* [[{{annotated link|Unicode equivalence]]}}
 
== References ==