Text normalization: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 17:15, 23 June 2007 edit Tsca.bot (talk \| contribs) 12,824 edits m robot Adding: pl:Normalizacja tekstu ← Previous edit		Latest revision as of 14:00, 14 November 2024 edit undo Kku (talk \| contribs) Extended confirmed users 122,082 edits →See also: annlink
(48 intermediate revisions by 40 users not shown)
Line 1: {{Short description\|Process of transforming text into a single canonical form}} '''Text normalization''' is a process by which [[writing\|text]] is transformed in some way to make it consistent in a way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating [[speech synthesis\|synthesized speech]], [[automated language translation]], storage in a [[database]], or comparison. {{Use American English\|date=March 2021}} {{Use mdy dates\|date=March 2021}} {{Distinguish\|word normalization\|Unicode normalization}} '''Text normalization''' is the process of transforming [[writing\|text]] into a single [[canonical form]] that it might not have had before. Normalizing text before storing or processing it allows for [[separation of concerns]], since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.<ref name="cs506">{{cite web ~~Examples of text normalization:~~ \| title = CS506/606: Txt Nrmlztn \| author = [[Richard Sproat]] and Steven Bedrick \| date = September 2011 \| accessdate = October 2, 2012 \| url = http://www.csee.ogi.edu/~sproatr/Courses/TextNorm/}}</ref> == Applications== * [[Unicode normalization]] * converting all letters to lower or upper case * removing punctuation * removing letters with accent marks and other diacritics * expanding abbreviations Text normalization is frequently used when converting [[speech synthesis\|text to speech]]. [[Number]]s, [[Calendar date\|date]]s, [[acronym]]s, and [[abbreviation]]s are non-standard "words" that need to be pronounced differently depending on context.<ref name="sproate">Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' '''15'''; 287–333. [[Digital object identifier\|doi]]:[https://dx.doi.org/10.1006/csla.2001.0169 10.1006/csla.2001.0169].</ref> For example: ~~While this may be done manually, and usually is in the case of ad hoc and personal documents, many [[programming language]]s support mechanisms which enable text normalization.~~ * "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.<ref>{{cite web ~~{{compu-sci-stub}}~~ \| title = Samoan Numbers \| work = MyLanguages.org \| accessdate = October 2, 2012 \| url = http://mylanguages.org/samoan_numbers.php}}</ref> * "vi" could be pronounced as "[[Violet (name)\|vie]]," "[[Vi (text editor)\|vee]]," or "[[Roman numerals\|the sixth]]" depending on the surrounding words.<ref name="msdn">{{cite web \| title = Text-to-Speech Engines Text Normalization \| work = MSDN \| accessdate = October 2, 2012 \| url = http://msdn.microsoft.com/en-us/library/ms699266(v=vs.85).aspx}}</ref> Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing [[diacritical marks]]; and if "john" is to match "John", the text would be converted to a single [[letter case\|case]]. To prepare text for searching, it might also be [[stemming\|stemmed]] (e.g. converting "flew" and "flying" both into "fly"), [[Canonicalization\|canonicalized]] (e.g. consistently using [[American and British English spelling differences\|American or British English spelling]]), or have [[stop word]]s removed. ~~[[Category:Unicode]]~~ == Techniques == ~~[[pl:Normalizacja tekstu]]~~ For simple, context-independent normalization, such as removing non-[[alphanumeric]] characters or [[diacritical marks]], [[regular expressions]] would suffice. For example, the [[sed]] script <code>sed ‑e "s/\s+/ /g"  ''inputfile''</code> would normalize runs of [[whitespace character]]s into a single space. More complex normalization requires correspondingly complicated algorithms, including [[___domain knowledge]] of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text<ref name="tagging">Zhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. [[Digital object identifier\|doi]]:[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.8138 10.1.1.72.8138].</ref> and as a special case of machine translation.<ref name="mt">Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006). [https://annals-csis.org/proceedings/2006/pliks/202.pdf "Text Normalization as a Special Case of Machine Translation."] ''Proceedings of the International Multiconference on Computer Science and Information Technology'' '''1'''; 51–56.</ref><ref name="sm">Mosquera, A.; Lloret, E.; Moreda, P. (2012). [http://lrec.elra.info/proceedings/lrec2012/workshops/25.NLP4ITA-Proceedings.pdf#page=14 "Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation"] ''Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)''; 9-14</ref> ==Textual scholarship== In the field of [[textual scholarship]] and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of [[scribal abbreviation]]s and the transliteration of the archaic [[glyph]]s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a ''[[Diplomatics#Diplomatic editions and transcription\|diplomatic edition]]'' (or ''semi-diplomatic edition''), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.<ref>{{cite book \|first=P. D. A. \|last=Harvey \|title=Editing Historical Records \|publisher=British Library \|place=London \|year=2001 \|isbn=0-7123-4684-8 \|pages=40–46 }}</ref> == See also == * {{annotated link\|Automated paraphrasing}} * {{annotated link\|Canonicalization}} * {{annotated link\|Text simplification}} * {{annotated link\|Unicode equivalence}} == References == {{Reflist}} [[Category:Natural language processing]]