Unicode equivalence: Difference between revisions

Content deleted Content added
Character duplication: Revised per talk page
Tags: Mobile edit Mobile web edit Advanced mobile edit
m c/e: mostly commas
Line 1:
{{Short description|Aspect of the Unicode Standardstandard}}
{{Refimprove|date=November 2014}}
'''Unicode equivalence''' is the specification by the [[Unicode]] [[character (computing)|character]] encoding standard that some sequences of [[code point]]s represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexistingpre-existing standard [[character set]]s, which often included similar or identical characters.
 
[[Unicode]] provides two such notions, [[canonical form|canonical]] equivalence and compatibility. [[Code point]] sequences that are defined as '''canonically equivalent''' are assumed to have the same appearance and meaning when printed or displayed. For example, the code point {{unichar|006E|Latin small letter n|nlink=N}} followed by {{unichar|0303|Combining tilde|cwith=◌|nlink=combining character}} is defined by Unicode to be canonically equivalent to the single code point {{unichar|00F1|LATIN SMALL LETTER N WITH TILDE}} of the [[Spanish alphabet]]). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as [[alphabetical order|alphabetizing]] names or [[string searching|searching]], and may be substituted for each other. Similarly, each [[Hangul]] syllable block that is encoded as a single character may be equivalently encoded as a combination of a leading conjoining jamo, a vowel conjoining jamo, and, if appropriate, a trailing conjoining jamo.
Line 42:
 
===Typographical non-interaction===
Some scripts regularly use multiple combining marks that do not, in general, interact typographically, and do not have precomposed characters for the combinations. Pairs of such non-interacting marks can be stored in either order. These alternative sequences are, in general, canonically equivalent. The rules that define their sequencing in the canonical form also define whether they are considered to interact.
 
===Typographic conventions===
Unicode provides code points for some characters or groups of characters which are modified only for aesthetic reasons (such as [[Typographic ligature|ligatures]], the [[half-width katakana]] characters, or the [[full-width]] Latin letters for use in Japanese texts), or to add new semantics without losing the original one (such as digits in [[subscript]] or [[superscript]] positions, or the circled digits (such as "①") inherited from some Japanese fonts). Such a sequence is considered compatible with the sequence of original (individual and unmodified) characters, for the benefit of applications where the appearance and added semantics are not relevant. However, the two sequences are not declared canonically equivalent, since the distinction has some semantic value and affects the rendering of the text.
 
===Encoding errors===
Line 57:
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the [[representative (mathematics)|representative]] element of an [[equivalence class]], multiple canonical forms are possible for each equivalence criterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a '''canonical ordering''' on the code point sequence, which is necessary for the normal forms to be unique.
 
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance, some [[typographic ligature]]s like U+FB03 ({{char|ffi}}), [[Roman numerals]] like U+2168 ({{char|Ⅸ}}) and even [[Unicode subscripts and superscripts|subscripts and superscripts]], e.g. U+2075 ({{char|⁵}}) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature into the constituent letters, so a search for U+0066 ({{char|f}}) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter {{char|I}} (U+0049) in the precomposed Roman numeral {{char|Ⅸ}} (U+2168). Similarly, the superscript {{char|⁵}} (U+2075) is transformed to {{char|5}} (U+0035) by compatibility mapping.
 
Transforming superscripts into baseline equivalents may not be appropriate, however, for [[rich text]] software, because the superscript information is lost in the process. To allow for this distinction, the Unicode character database contains '''compatibility formatting tags''' that provide additional details on the compatibility transformation.<ref>{{cite web|url=https://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings|title=UAX #44: Unicode Character Database|publisher=Unicode.org|access-date=20 November 2014}}</ref> In the case of typographic ligatures, this tag is simply <code><compat></code>, while for the superscript it is <code><super></code>. Rich text standards like [[HTML]] take into account the compatibility tags. For instance, HTML uses its own markup to position a U+0035 in a superscript position.<ref>{{cite web|url=http://unicode.org/reports/tr20/tr20-2.html#Compatibility|title=Unicode in XML and other Markup Languages|publisher=Unicode.org|access-date=20 November 2014}}</ref>
 
===Normal forms===