Revision as of 13:35, 15 March 2021 edit Ivan Humphrey (talk \| contribs) Extended confirmed users 5,875 edits short description ← Previous edit		Revision as of 17:59, 15 April 2021 edit undo GrindtXX (talk \| contribs) Extended confirmed users, IP block exemptions 34,584 edits →Techniques: added section on textual scholarship Next edit →
Line 30: == Techniques == For simple, context-independent normalization, such as removing non-[[alphanumeric]] characters or [[diacritical marks]], [[regular expressions]] would suffice. For example, the [[sed]] script <code>sed ‑e "s/\s+/ /g"  ''inputfile''</code> would normalize runs of [[whitespace character]]s into a single space. More complex normalization requires correspondingly complicated algorithms, including [[___domain knowledge]] of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text<ref name="tagging">Zhu, C.; Tang, J.; Li, H.; Ng , H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. [[Digital object identifier\|doi]]:[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.8138 10.1.1.72.8138].</ref> and as a special case of machine translation.<ref name="mt">Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006). [https://annals-csis.org/proceedings/2006/pliks/202.pdf "Text Normalization as a Special Case of Machine Translation."] ''Proceedings of the International Multiconference on Computer Science and Information Technology'' '''1'''; 51–56.</ref><ref name="sm">Mosquera, A.; Lloret, E.; Moreda, P. (2012). [http://lrec.elra.info/proceedings/lrec2012/workshops/25.NLP4ITA-Proceedings.pdf#page=14 "Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation"] ''Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)''; 9-14</ref> ==Textual scholarship== In the field of [[textual scholarship]] and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of [[scribal abbreviation]]s and the transliteration of the archaic [[glyph]]s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a [[Diplomatics#Diplomatic editions and transcription\|''diplomatic'' (or ''semi-diplomatic'') ''edition'']], in which some attempt is made to preserve these features. However, the extent of normalization is at the discretion of the editor, and will vary: some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.<ref>{{cite book \|first=P. D. A. \|last=Harvey \|title=Editing Historical Records \|publisher=British Library \|place=London \|year=2001 \|isbn=0-7123-4684-8 \|pages=40–46 }}</ref> == See also ==

Text normalization: Difference between revisions