Unicode equivalence: Difference between revisions

Content deleted Content added
m ln
Bender the Bot (talk | contribs)
 
(194 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Aspect of the Unicode standard}}
[[Unicode]] contains numerous [[character (computing)|character]]s to maintain compatibility with existing standards, some of which are functionally equivalent to other characters or sequences of characters. Because of this, Unicode defines some code point sequences as equivalent. Unicode provides two notions of equivalence: '''canonical''', and '''compatibility''', the former being a [[subset]] of the latter. For example, the n character followed by the [[combining character|combining]] ~ character is (canonically and thus compatibility) equivalent to the single Unicode ñ character, while the [[typographic ligature]] ff is only compatibility equivalent with the sequence of two f characters.
{{Refimprove|date=November 2014}}
'''Unicode equivalence''' is the specification by the [[Unicode]] [[character (computing)|character]] encoding standard that some sequences of [[code point]]s represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard [[character set]]s, which often included similar or identical characters.
 
[[Unicode]] provides two such notions, [[canonical form|canonical]] equivalence and compatibility. [[Code point]] sequences that are defined as '''canonically equivalent''' are assumed to have the same appearance and meaning when printed or displayed. For example, the code point {{unichar|006E|Latin small letter n|nlink=N}} followed by {{unichar|0303|Combining tilde|cwith=◌|nlink=combining character}} is defined by Unicode to be canonically equivalent to the single code point {{unichar|00F1|LATIN SMALL LETTER N WITH TILDE}} of the [[Spanish alphabet]]). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as [[alphabetical order|alphabetizing]] names or [[string searching|searching]], and may be substituted for each other. Similarly, each [[Hangul]] syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Unicode '''normalization''' is a form of [[text normalization]] that transforms equivalent sequences of characters into the same representation, called a '''normal'''<nowiki></nowiki>'''''ization''''' '''form''' in the Unicode standard, but which will be called simply [[Normal_form_(term_rewriting) | normal form]] in this article. For each of the two equivalence notions, Unicode defines two canonical forms, one fully composed, and one fully decomposed, resulting in four normal forms, abbreviated NFC, NFD, NFKC, and NFKD, which are detailed in this article. Unicode normalization is important in Unicode text processing applications, because it affects the semantics of comparing, searching, and sorting Unicode sequences.
 
Sequences that are defined as '''compatible''' are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the [[typographic ligature]] "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as [[sorting]] and [[index (database)|index]]ing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
== Equivalence Notions ==
 
The standard also defines a [[text normalization]] procedure, called '''Unicode normalization''', that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the '''normalization form''' or '''normal form''' of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one '''fully composed''' (where multiple code points are replaced by single points whenever possible), and one '''fully decomposed''' (where single points are split into multiple ones).
===Canonical Equivalence===
 
==Sources of equivalence==
Underlying Unicode's concept of canonical equivalence are the reciprocal notion of character composition and decomposition. Character composition is the process of combining simpler characters into fewer [[precomposed character]]s, such as the n character and the combining ~ character into the single ñ character. Decomposition is the opposite process, breaking precomposed characters back into their component pieces.
 
===Character duplication===
Canonical equivalence is a form of equivalence that preserves visually and functionally equivalent characters. For example, precomposed diacritic letters are considered canonically equivalent to their decomposed letter and combining diacritic marks. In other words, the precomposed character ‘ü’ is a canonical equivalent to the sequence ‘u’ and ‘¨’ a combining diaeresis. Similarly, Unicode unifies several Greek diacritics and punctuation characters that have the same appearance to other diacritics.
{{Main|Duplicate characters in Unicode}}
For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a [[ring diacritic]] above" is encoded as {{unichar|00C5}} (a letter of the [[alphabet]] in [[Swedish language|Swedish]] and several other [[language]]s) or as {{unichar|212B}}. Yet the symbol for [[angstrom]] is defined to be that Swedish letter, and most other symbols that are letters (such as {{angbr|V}} for [[volt]]) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.
 
===Combining and precomposed characters===
===Compatibility Equivalence===
For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the [[Dutch alphabet|Dutch letter]] "[[IJ (digraph)|ij]]")
 
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding [[base character]]. Examples of these [[combining character]]s are {{unichar|0303|cwith=◌|nlink=}} and the [[Japanese script|Japanese]] diacritic [[dakuten]] ({{unichar|3099|cwith=◌|use=lang|use2=ja}}).
Compatibility equivalence is broader in scope than canonical equivalence. Anything that is canonically equivalent is also compatibility equivalent, but the opposite is not necessarily true. The compatibility equivalence notion is more concerned with plain text equivalence and may lump together some semantically distinct forms.
 
In the context of Unicode, '''character composition''' is the process of replacing the code points of a base letter followed by one or more combining characters into a single [[precomposed character]]; and '''character decomposition''' is the opposite process.
For example, superscript and subscript numerals are compatibility equivalent to their core decimal digit counterparts, even though they are not canonically equivalent to them. The rationale is that subscript and superscript forms — through their visually distinct presentation — sometimes convey a distinct meaning, but there may be valid applications in which to consider them equivalent. Superscript and subscripts can be handled in a less cumbersome manner in the context of Unicode rich text formats (see next section).
 
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.
Full-width and half-width [[katakana]] characters are also compatibility equivalent but not canonically equivalent, as are ligatures and their component letter sequences. For these latter examples, there is usually only a visual and not a semantic distinction. In other words, an author does not typically declare the presence of ligatures or vertical text as meaning one thing and non-ligatures and horizontal text as meaning something entirely different. Rather these are strictly visual typographic design choices.
 
== Normalization ==Example====
{| class="wikitable" align="center" width="50%" style="text-align: center;"
|+ |''Amélie'' with its two canonically equivalent [[Unicode]] forms ([[#Normal_forms|NFC and NFD]])
|- style="background-color:#ffeaea"
! style="width: 10em;" | NFC character
| | A || m || colspan="2" | é || l || i || e
|- style="background-color:#ffc6c6"
! NFC code point
| 0041 ||006d || colspan="2" |00e9 ||006c ||0069 ||0065
|- style="background-color:#c6efff"
! NFD code point
| 0041 ||006d ||0065 ||0301 ||006c ||0069 ||0065
|- style="background-color:#eaf9ff"
! NFD character
| A || m || e || ◌́ || l || i ||e
|}
 
===Typographical non-interaction===
The implementation of Unicode string searches and comparisons in text processing software must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.
 
===Typographic conventions===
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an [[equivalence class]], multiple canonical forms are possible for each equivalence criteria. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a '''canonical ordering''' on the code point sequence, which is necessary for the normal forms to be unique.
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as [[Typographic ligature|ligatures]], the [[half-width katakana]] characters, or the [[full-width]] Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in [[subscript]] or [[superscript]] positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
 
===Encoding errors===
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some [[typographic ligatures]] like U+FB03 (ffi), [[roman numerals]] like U+2168 (Ⅸ) and even [[Unicode_subscripts_and_superscripts | subscripts and superscripts]], e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any these, but compatibility normalization (NFK) will decompose the ffi ligature in the constituent letters, so a search for U+0066 (f) as substring would succeed in NFKC(U+FB03) but not in NFC(U+FB03), and analogously when searching for U+0049 (I) in U+2168. The superscript U+2075 is transformed to U+0035 (5) by compatibility mapping.
 
[[UTF-8]] and [[UTF-16]] (and also some other Unicode encodings) do not allow all possible sequences of [[code unit]]s. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.
Transforming superscripts into baseline equivalents may not be appropriate however for rich text software, be cause the subscript information is lost in the process. To allow for this distinction, the Unicode character database contains '''compatibility formatting tags''' that provide additional details on the compatibility transformation.<ref>http://www.unicode.org/Public/UNIDATA/UCD.html#Character_Decomposition_Mappings</ref> In the case of typographic ligatures, this tag is simply <code><compat></code>, while for the superscript it is <code><super></code>. Rich text standards like [[HTML]] take into account the compatibility tags. For instance HTML uses its own markup to position a U+0035 in a superscript position.<ref>http://unicode.org/reports/tr20/tr20-2.html#Compatibility</ref>
 
===Normal forms=Normalization==
TheA implementationtext ofprocessing software implementing the Unicode string searchessearch and comparisonscomparison in text processing softwarefunctionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
 
=== Algorithms ===
Unicode defines four normal forms. These and the algorithms (transformations) for obtaining them are listed in the table below. All these forms impose the canonical order on the resulting sequence to guarantee uniqueness of the result over the corresponding equivalence class. All these algorithms are [[idempotent]] transformations, but none of them are [[injective]] due to singletons (see example after the table). Also, none of the normal forms are [[Closure_(mathematics) | closed]] under string [[concatenation]], meaning that the concatenation of two strings in the same normal form is not guaranteed to be a normal form; this happens due to the canonical ordering, see the next section for details.
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the [[representative (mathematics)|representative]] element of an [[equivalence class]], multiple canonical forms are possible for each equivalence criteriacriterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a '''canonical ordering''' on the code point sequence, which is necessary for the normal forms to be unique.
 
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some [[typographic ligaturesligature]]s like U+FB03 ({{char|}}), [[romanRoman numerals]] like U+2168 ({{char|}}) and even [[Unicode_subscripts_and_superscriptsUnicode |subscripts and superscripts|subscripts and superscripts]], e.g. U+2075 ({{char|}}) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature ininto the constituent letters, so a search for U+0066 ({{char|f}}) as substring would succeed in an NFKC( normalization of U+FB03) but not in NFC( normalization of U+FB03),. and analogouslyLikewise when searching for the Latin letter {{char|I}} (U+0049 (I) in the precomposed Roman numeral {{char|Ⅸ}} (U+2168). TheSimilarly, the superscript {{char|⁵}} (U+2075) is transformed to {{char|5}} (U+0035 (5) by compatibility mapping.
{| border="1" cellspacing="0"
 
Transforming superscripts into baseline equivalents may not be appropriate, however, for [[rich text]] software, be causebecause the subscriptsuperscript information is lost in the process. To allow for this distinction, the Unicode character database contains '''compatibility formatting tags''' that provide additional details on the compatibility transformation.<ref>http{{cite web|url=https://www.unicode.org/Publicreports/UNIDATAtr44/UCD.html#Character_Decomposition_Mappings|title=UAX #44: Unicode Character Database|publisher=Unicode.org|access-date=20 November 2014}}</ref> In the case of typographic ligatures, this tag is simply <code><compat></code>, while for the superscript it is <code><super></code>. Rich text standards like [[HTML]] take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.<ref>{{cite web|url=http://unicode.org/reports/tr20/tr20-2.html#Compatibility|title=Unicode in XML and other Markup Languages|publisher=Unicode.org|access-date=20 November 2014}}</ref>
 
===Normal forms===
The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.
 
{| class="wikitable"
|'''NFD'''<br>''Normalization Form Canonical Decomposition''
|Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
|-
|'''NFC'''<br>''Normalization Form Canonical Composition''
|Characters are decomposed and then recomposed by canonical equivalence. It is possible for the result to be a different sequence of characters than the original, in the case of '''singletons''', see example below the table.
|-
|'''NFKD'''<br>''Normalization Form Compatibility Decomposition''
|Characters are decomposed by compatibility, equivalenceand multiple combining characters are arranged in a specific order.
|-
|'''NFKC'''<br>''Normalization Form Compatibility Composition''
|Characters are decomposed by compatibility equivalence, then recomposed by canonical equivalence.
|}
 
All these algorithms are [[idempotent]] transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.
Certain code points are irretrievably aliased to other code points by any of the normalization transformations from the above table. An alternative way to put it is to say that singletons never belong to any normal form. An example is U+212B (Å), the [[Angstrom]] sign, which is always replaced by the visually identical U+00C5 (Å – A with ring above) in NFC, which in turn is equivalent to the NFD two character sequence composed of U+0041 (A) and U+030A (° – combining ring above). Thus, none of the normalization functions are injective.
 
The normal forms are not [[closure (mathematics)|closed]] under string [[concatenation]].<ref> Per [http://www.unicode.org/faq/normalization.html#5 What should be done about concatenation]</ref> For defective Unicode strings starting with a Hangul vowel or trailing [[Hangul Jamo (Unicode block)|conjoining jamo]], concatenation can break Composition.
In the Unicode character database singletons are those characters that have a non-empty compatibility field but without any compatibility tag, which makes the mapping canonical.
 
However, they are not [[injective function|injective]] (they map different original glyphs and sequences to the same normalized sequence) and thus also not [[bijection|bijective]] (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining [[ring above]] "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
===Canonical ordering===
 
A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.
 
===Canonical Equivalenceordering===
The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be [[diacritics]], even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
 
Unicode assigns each character a '''combining class''', which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a [[Sorting_algorithmSorting algorithm#Stability | stable sorting]] algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are ''not'' considered equivalent.
 
For example, the character U+1EBF (ế), used in [[Vietnamese_alphabetVietnamese alphabet| Vietnamese]], has both an acute and a circumflex accent. It'sIts canonical decomposition is the three-character sequence U+00450065 (Ee) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent withto U+00450065 U+0301 U+0302.
 
Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+03010302), even the normal form NFC is affected by composingcombining characters' behavior.
 
==Errors due to normalization differences==
==Notes==
When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, [[OS X]] normalized Unicode filenames sent from the [[Netatalk]] and [[Samba (software)|Samba]] file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.<ref>{{cite web|url=https://sourceforge.net/tracker/?func=detail&aid=2727174&group_id=8642&atid=108642|title=netatalk / Bugs / #349 volcharset:UTF8 doesn't work from Mac|website=[[SourceForge]]|access-date=20 November 2014}}</ref><ref>{{cite web |url=http://forums.macosxhints.com/archive/index.php/t-99344.html |title=rsync, samba, UTF8, international characters, oh my! |archive-url=https://web.archive.org/web/20100109162824/http://forums.macosxhints.com/archive/index.php/t-99344.html |year=<!--03-01-2009-->2009 |archive-date=January 9, 2010}}</ref> Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
{{reflist}}
 
==See also==
* [[Complex text layout]]
*[[Unicode]]
* [[Diacritic]]
*[[Ligature (typography)]]
* [[IDN homograph attack]]
*[[Diacritic]]
* [[PrecomposedISO/IEC character14651]]
* [[Ligature (typography)]]
*[[Unicode compatibility characters]]
* [[Complex TextPrecomposed Layoutcharacter]]
* The [[uconv]] tool can convert to and from NFC and NFD Unicode normalization forms.
* [[Unicode]]
* [[Unicode compatibility characters]]
 
==Notes==
{{reflist}}
 
==References==
* [http://unicode.org/reports/tr15/ Unicode Standard Annex #15: Unicode Normalization Forms]
 
==External links==
* [httphttps://www.unicode.org/unicode/faq/normalization.html Unicode.org FAQ - Normalization]
* [http://www.w3.org/International/charlint/ Charlint - a character normalization tool] written in Perl
{{Unicode navigation}}
 
[[Category:<!-- Unicode]] equivalence-->
<!-- Unicode equivalence-->
[[Category:Unicode algorithms|Normalization]]
<!-- Unicode normalization-->
<!-- Unicode normalization-->
<!-- Unicode normalization-->
 
[[jaCategory:Unicodeの等価性 algorithms|Equivalence]]