Content deleted Content added
short description |
→Techniques: added section on textual scholarship |
||
Line 30:
== Techniques ==
For simple, context-independent normalization, such as removing non-[[alphanumeric]] characters or [[diacritical marks]], [[regular expressions]] would suffice. For example, the [[sed]] script <code>sed ‑e "s/\s+/ /g" ''inputfile''</code> would normalize runs of [[whitespace character]]s into a single space. More complex normalization requires correspondingly complicated algorithms, including [[___domain knowledge]] of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text<ref name="tagging">Zhu, C.; Tang, J.; Li, H.; Ng
==Textual scholarship==
In the field of [[textual scholarship]] and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of [[scribal abbreviation]]s and the transliteration of the archaic [[glyph]]s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a [[Diplomatics#Diplomatic editions and transcription|''diplomatic'' (or ''semi-diplomatic'') ''edition'']], in which some attempt is made to preserve these features. However, the extent of normalization is at the discretion of the editor, and will vary: some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.<ref>{{cite book |first=P. D. A. |last=Harvey |title=Editing Historical Records |publisher=British Library |place=London |year=2001 |isbn=0-7123-4684-8 |pages=40–46 }}</ref>
== See also ==
|