Unicode equivalence: Difference between revisions

Content deleted Content added
Amikake3 (talk | contribs)
m +ja:
Bender the Bot (talk | contribs)
 
(217 intermediate revisions by more than 100 users not shown)
Line 1:
{{Short description|Aspect of the Unicode standard}}
[[Unicode]] contains numerous characters to maintain compatibility with existing standards, some of which are functionally equivalent to other characters or sequences of characters. Because of this, Unicode defines some as equivalent. For example, the n character followed by the combining ~ character is equivalent to the single Unicode ñ character. Unicode maintains two standards for defining equivalence.
{{Refimprove|date=November 2014}}
'''Unicode equivalence''' is the specification by the [[Unicode]] [[character (computing)|character]] encoding standard that some sequences of [[code point]]s represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard [[character set]]s, which often included similar or identical characters.
 
[[Unicode]] provides two such notions, [[canonical form|canonical]] equivalence and compatibility. [[Code point]] sequences that are defined as '''canonically equivalent''' are assumed to have the same appearance and meaning when printed or displayed. For example, the code point {{unichar|006E|Latin small letter n|nlink=N}} followed by {{unichar|0303|Combining tilde|cwith=◌|nlink=combining character}} is defined by Unicode to be canonically equivalent to the single code point {{unichar|00F1|LATIN SMALL LETTER N WITH TILDE}} of the [[Spanish alphabet]]). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as [[alphabetical order|alphabetizing]] names or [[string searching|searching]], and may be substituted for each other. Similarly, each [[Hangul]] syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
==Canonical Equivalence==
 
Sequences that are defined as '''compatible''' are assumed to have possibly distinct appearances, but the same meaning in some contexts. Thus, for example, the code point U+FB00 (the [[typographic ligature]] "ff") is defined to be compatible—but not canonically equivalent—to the sequence U+0066 U+0066 (two Latin "f" letters). Compatible sequences may be treated the same way in some applications (such as [[sorting]] and [[index (database)|index]]ing), but not in others; and may be substituted for each other in some situations, but not in others. Sequences that are canonically equivalent are also compatible, but the opposite is not necessarily true.
Canonical equivalence is a narrower form of equivalence that preserves visually and functionally equivalent characters. For example, precomposed diacritic letters are considered canonically equivalent to their decomposed letter and combining diacritic marks. In other words the precomposed character ‘ü’ is a canonical equivalent to the sequence ‘u’ and ‘¨’ a combining diaeresis. Similarly, Unicode unifies several Greek diacritics and punctuation characters that have the same appearance to other diacritics.
 
The standard also defines a [[text normalization]] procedure, called '''Unicode normalization''', that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the '''normalization form''' or '''normal form''' of the original text. For each of the two equivalence notions, Unicode defines two normal forms, one '''fully composed''' (where multiple code points are replaced by single points whenever possible), and one '''fully decomposed''' (where single points are split into multiple ones).
==Compatibility Equivalence==
 
==Sources of equivalence==
Compatibility equivalence is broader than canonical equivalence. Anything that is canonically equivalent is also compatibility equivalent, but the opposite is not necessarily true., The non-canonical equivalent compatibility characters are more concerned with plain text equivalence visually and therefore potentially semantically distinct forms. For example, superscript, subscript numerals, are compatibility equivalent to their core decimal digit counterparts. However, the subscript and superscript forms — through their visually distinct presentation — also typically convey distinct meaning. However, this distinct meaning could be better handled in a more open-ended way through the use of rich text protocols beyond Unicode. For example, though the character set includes subscript digits 0 through 9. Other characters can only be made subscript through the use of rich text protocols. Therefore Unicode considers such visual and semantic variations a task for rich text and not plain text. Full-width and half-width katakana characters are also equivalent, as are ligatures and their component letter sequences. For these latter examples, there is usually only a visual and not a semantic distinction. In other words, an author does not typically declare the presence of ligatures or vertical text as meaning one thing and non-ligatures and horizontal text as meaning something entirely different. Rather these are strictly visual typographic design choices.
 
===Character duplication===
== Visual ambiguity ==
{{Main|Duplicate characters in Unicode}}
The presence of either canonical or non-canonical equivalent characters can lead to visual ambiguity and confusion for users of text processing software. For example, software should typically render the canonical equivalent characters as indistinguishable from one another. If a user performs a search for one character, it may not be clear why the software does not highlight an identical looking character. For the non-canonical equivalent characters the visual ambiguity can arise when, for example, a superscript digit character appears alongside a standard digit with rich text superscript formatting. To handle such situations, Unicode recommends text processing algorithms such as normalization that treats these characters and character sequences as identical in certain circumstances.
For compatibility or other reasons, Unicode sometimes assigns two different code points to entities that are essentially the same character. For example, the letter "A with a [[ring diacritic]] above" is encoded as {{unichar|00C5}} (a letter of the [[alphabet]] in [[Swedish language|Swedish]] and several other [[language]]s) or as {{unichar|212B}}. Yet the symbol for [[angstrom]] is defined to be that Swedish letter, and most other symbols that are letters (such as {{angbr|V}} for [[volt]]) do not have a separate code point for each usage. In general, the code points of truly identical characters are defined to be canonically equivalent.
 
===Combining and precomposed characters===
For consistency with some older standards, Unicode provides single code points for many characters that could be viewed as modified forms of other characters (such as U+00F1 for "ñ" or U+00C5 for "Å") or as combinations of two or more characters (such as U+FB00 for the ligature "ff" or U+0132 for the [[Dutch alphabet|Dutch letter]] "[[IJ (digraph)|ij]]")
 
For consistency with other standards, and for greater flexibility, Unicode also provides codes for many elements that are not used on their own, but are meant instead to modify or combine with a preceding [[base character]]. Examples of these [[combining character]]s are {{unichar|0303|cwith=◌|nlink=}} and the [[Japanese script|Japanese]] diacritic [[dakuten]] ({{unichar|3099|cwith=◌|use=lang|use2=ja}}).
 
In the context of Unicode, '''character composition''' is the process of replacing the code points of a base letter followed by one or more combining characters into a single [[precomposed character]]; and '''character decomposition''' is the opposite process.
 
In general, precomposed characters are defined to be canonically equivalent to the sequence of their base letter and subsequent combining diacritic marks, in whatever order these may occur.
 
====Example====
{| class="wikitable" align="center" width="50%" style="text-align: center;"
|+ |''Amélie'' with its two canonically equivalent [[Unicode]] forms ([[#Normal_forms|NFC and NFD]])
|- style="background-color:#ffeaea"
! style="width: 10em;" | NFC character
| | A || m || colspan="2" | é || l || i || e
|- style="background-color:#ffc6c6"
! NFC code point
| 0041 ||006d || colspan="2" |00e9 ||006c ||0069 ||0065
|- style="background-color:#c6efff"
! NFD code point
| 0041 ||006d ||0065 ||0301 ||006c ||0069 ||0065
|- style="background-color:#eaf9ff"
! NFD character
| A || m || e || ◌́ || l || i ||e
|}
 
===Typographical non-interaction===
Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.
 
===Typographic conventions===
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as [[Typographic ligature|ligatures]], the [[half-width katakana]] characters, or the [[full-width]] Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in [[subscript]] or [[superscript]] positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
 
===Encoding errors===
 
[[UTF-8]] and [[UTF-16]] (and also some other Unicode encodings) do not allow all possible sequences of [[code unit]]s. Different software will convert invalid sequences into Unicode characters using varying rules, some of which are very lossy (e.g., turning all invalid sequences into the same character). This can be considered a form of normalization and can lead to the same difficulties as others.
 
==Normalization==
A text processing software implementing the Unicode string search and comparison functionality must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
 
=== Algorithms ===
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the [[representative (mathematics)|representative]] element of an [[equivalence class]], multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a '''canonical ordering''' on the code point sequence, which is necessary for the normal forms to be unique.
 
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some [[typographic ligature]]s like U+FB03 ({{char|ffi}}), [[Roman numerals]] like U+2168 ({{char|Ⅸ}}) and even [[Unicode subscripts and superscripts|subscripts and superscripts]], e.g. U+2075 ({{char|⁵}}) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 ({{char|f}}) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter {{char|I}} (U+0049) in the precomposed Roman numeral {{char|Ⅸ}} (U+2168). Similarly, the superscript {{char|⁵}} (U+2075) is transformed to {{char|5}} (U+0035) by compatibility mapping.
 
Transforming superscripts into baseline equivalents may not be appropriate, however, for [[rich text]] software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains '''compatibility formatting tags''' that provide additional details on the compatibility transformation.<ref>{{cite web|url=https://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings|title=UAX #44: Unicode Character Database|publisher=Unicode.org|access-date=20 November 2014}}</ref> In the case of typographic ligatures, this tag is simply <code><compat></code>, while for the superscript it is <code><super></code>. Rich text standards like [[HTML]] take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.<ref>{{cite web|url=http://unicode.org/reports/tr20/tr20-2.html#Compatibility|title=Unicode in XML and other Markup Languages|publisher=Unicode.org|access-date=20 November 2014}}</ref>
 
===Normal forms===
The four Unicode normalization forms and the algorithms (transformations) for obtaining them are listed in the table below.
 
{| class="wikitable"
|'''NFD'''<br>''Normalization Form Canonical Decomposition''
|Characters are decomposed by canonical equivalence, and multiple combining characters are arranged in a specific order.
|-
|'''NFC'''<br>''Normalization Form Canonical Composition''
|Characters are decomposed and then recomposed by canonical equivalence.
|-
|'''NFKD'''<br>''Normalization Form Compatibility Decomposition''
|Characters are decomposed by compatibility, and multiple combining characters are arranged in a specific order.
|-
|'''NFKC'''<br>''Normalization Form Compatibility Composition''
|Characters are decomposed by compatibility, then recomposed by canonical equivalence.
|}
 
All these algorithms are [[idempotent]] transformations, meaning that a string that is already in one of these normalized forms will not be modified if processed again by the same algorithm.
 
The normal forms are not [[closure (mathematics)|closed]] under string [[concatenation]].<ref> Per [http://www.unicode.org/faq/normalization.html#5 What should be done about concatenation]</ref> For defective Unicode strings starting with a Hangul vowel or trailing [[Hangul Jamo (Unicode block)|conjoining jamo]], concatenation can break Composition.
 
However, they are not [[injective function|injective]] (they map different original glyphs and sequences to the same normalized sequence) and thus also not [[bijection|bijective]] (cannot be restored). For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining [[ring above]] "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").
 
A single character (other than a Hangul syllable block) that will get replaced by another under normalization can be identified in the Unicode tables for having a non-empty compatibility field but lacking a compatibility tag.
 
===Canonical ordering===
The canonical ordering is mainly concerned with the ordering of a sequence of combining characters. For the examples in this section we assume these characters to be [[diacritics]], even though in general some diacritics are not combining characters, and some combining characters are not diacritics.
 
Unicode assigns each character a '''combining class''', which is identified by a numerical value. Non-combining characters have class number 0, while combining characters have a positive combining class value. To obtain the canonical ordering, every substring of characters having non-zero combining class value must be sorted by the combining class value using a [[Sorting algorithm#Stability|stable sorting]] algorithm. Stable sorting is required because combining characters with the same class value are assumed to interact typographically, thus the two possible orders are ''not'' considered equivalent.
 
For example, the character U+1EBF (ế), used in [[Vietnamese alphabet|Vietnamese]], has both an acute and a circumflex accent. Its canonical decomposition is the three-character sequence U+0065 (e) U+0302 (circumflex accent) U+0301 (acute accent). The combining classes for the two accents are both 230, thus U+1EBF is not equivalent to U+0065 U+0301 U+0302.
 
Since not all combining sequences have a precomposed equivalent (the last one in the previous example can only be reduced to U+00E9 U+0302), even the normal form NFC is affected by combining characters' behavior.
 
==Errors due to normalization differences==
When two applications share Unicode data, but normalize them differently, errors and data loss can result. In one specific instance, [[OS X]] normalized Unicode filenames sent from the [[Netatalk]] and [[Samba (software)|Samba]] file- and printer-sharing software. Netatalk and Samba did not recognize the altered filenames as equivalent to the original, leading to data loss.<ref>{{cite web|url=https://sourceforge.net/tracker/?func=detail&aid=2727174&group_id=8642&atid=108642|title=netatalk / Bugs / #349 volcharset:UTF8 doesn't work from Mac|website=[[SourceForge]]|access-date=20 November 2014}}</ref><ref>{{cite web |url=http://forums.macosxhints.com/archive/index.php/t-99344.html |title=rsync, samba, UTF8, international characters, oh my! |archive-url=https://web.archive.org/web/20100109162824/http://forums.macosxhints.com/archive/index.php/t-99344.html |year=<!--03-01-2009-->2009 |archive-date=January 9, 2010}}</ref> Resolving such an issue is non-trivial, as normalization is not losslessly invertible.
 
==See also==
* [[Complex text layout]]
*[[Unicode normalization]]
* [[Diacritic]]
*[[Ligature (typography)]]
* [[IDN homograph attack]]
*[[Diacritic]]
* [[PrecomposedISO/IEC character14651]]
* [[Ligature (typography)]]
*[[Unicode compatibility characters]]
* [[Complex TextPrecomposed Layoutcharacter]]
* The [[uconv]] tool can convert to and from NFC and NFD Unicode normalization forms.
* [[Unicode]]
* [[Unicode compatibility characters]]
 
==Notes==
{{reflist}}
 
==References==
* [http://unicode.org/reports/tr15/ Unicode Standard Annex #15: Unicode Normalization Forms]
 
==External links==
* [https://www.unicode.org/faq/normalization.html Unicode.org FAQ - Normalization]
* [http://www.w3.org/International/charlint/ Charlint - a character normalization tool] written in Perl
{{Unicode navigation}}
 
[[Category:<!-- Unicode]] equivalence-->
<!-- Unicode equivalence-->
<!-- Unicode normalization-->
<!-- Unicode normalization-->
<!-- Unicode normalization-->
 
[[jaCategory:Unicodeの等価性 algorithms|Equivalence]]