Content deleted Content added
m E.g. "Roman numeral", with latter lower case(?) but always former upper. |
→Typographic conventions: Mention error conversion as it is a form of normalization too |
||
Line 44:
===Typographic conventions===
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as [[Typographic ligature|ligatures]], the half-width [[katakana]] characters, or the double-width Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in [[subscript]] or [[superscript]] positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
===Encoding errors===
[[UTF-8]] and [[UTF-16]] (and also some other Unicode encodings) do not allow all possible sequences of [[code unit]]s. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (ie turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.
==Normalization==
|