Text normalization

This is an old revision of this page, as edited by Tsca.bot (talk | contribs) at 17:15, 23 June 2007 (robot Adding: pl:Normalizacja tekstu). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Text normalization is a process by which text is transformed in some way to make it consistent in a way which it may not have been before. Text normalization is often performed before a text is processed in some way, such as generating synthesized speech, automated language translation, storage in a database, or comparison.

Examples of text normalization:

  • Unicode normalization
  • converting all letters to lower or upper case
  • removing punctuation
  • removing letters with accent marks and other diacritics
  • expanding abbreviations

While this may be done manually, and usually is in the case of ad hoc and personal documents, many programming languages support mechanisms which enable text normalization.