Text normalization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:59, 15 April 2021 edit GrindtXX (talk \| contribs) Extended confirmed users, IP block exemptions 34,584 edits →Techniques: added section on textual scholarship ← Previous edit		Latest revision as of 14:00, 14 November 2024 edit undo Kku (talk \| contribs) Extended confirmed users 122,082 edits →See also: annlink
(7 intermediate revisions by 7 users not shown)
Line 1: {{~~short~~Short description\|~~process~~Process of transforming text into a single canonical form}} {{Use American English\|date=March 2021}} {{Use mdy dates\|date=March 2021}} Line 13: == Applications== Text normalization is frequently used when converting [[speech synthesis\|text to speech]]. [[Number]]s, [[Calendar date\|date]]s, [[acronym]]s, and [[abbreviation]]s are non-standard "words" that need to be pronounced differently depending on context.<ref name="sproate">Sproat, R.; Black, A.; Chen, S.; Kumar, S.; ~~Ostendorfk~~Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' '''15'''; 287–333. [[Digital object identifier\|doi]]:[https://dx.doi.org/10.1006/csla.2001.0169 10.1006/csla.2001.0169].</ref> For example: * "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.<ref>{{cite web Line 20: \| accessdate = October 2, 2012 \| url = http://mylanguages.org/samoan_numbers.php}}</ref> * "vi" could be pronounced as "[[viViolet (name)\|vie]]e," "[[~~Violet~~Vi (~~name~~text editor)\|vee]]," or "[[Roman numerals\|the sixth]]" depending on the surrounding words.<ref name="msdn">{{cite web \| title = Text-to-Speech Engines Text Normalization \| work = MSDN Line 33: ==Textual scholarship== In the field of [[textual scholarship]] and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of [[scribal abbreviation]]s and the transliteration of the archaic [[glyph]]s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a ''[[Diplomatics#Diplomatic editions and transcription\|''diplomatic edition]]'' (or ''semi-diplomatic~~'')~~ ''edition'']]), in which some attempt is made to preserve these features. ~~However~~The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary:. ~~some~~Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.<ref>{{cite book \|first=P. D. A. \|last=Harvey \|title=Editing Historical Records \|publisher=British Library \|place=London \|year=2001 \|isbn=0-7123-4684-8 \|pages=40–46 }}</ref> == See also == * [[{{annotated link\|Automated paraphrasing]]}} * [[{{annotated link\|Canonicalization]]}} * [[{{annotated link\|Text simplification]]}} * [[{{annotated link\|Unicode equivalence]]}} == References ==