Unicode equivalence: Difference between revisions

Content deleted Content added
Normal forms: added some clarification about resons for idempotent and non-closure of normalization algorithms
Line 23:
The implementation of Unicode string searches and comparisons in text processing software must take into account the presence of equivalent code points. In the absence of this feature, users searching for a particular code point sequence would be unable to find other visually indistinguishable glyphs that have a different, but canonically equivalent, code point representation.
 
Unicode provides standard normalization algorithms that produce a unique (normal) code point sequence for all sequences that are equivalent; the equivalence criteria can be either canonical (NF) or compatibility (NFK). Since one can arbitrarily choose the representative element of an [[equivalence class]], multiple canonical forms are possible for each equivalence criteriacriterion. Unicode provides two normal forms that are semantically meaningful for each of the two compatibility criteria: the composed forms NFC and NFKC, and the decomposed forms NFD and NFKD. Both the composed and decomposed forms impose a '''canonical ordering''' on the code point sequence, which is necessary for the normal forms to be unique.
 
In order to compare or search Unicode strings, software can use either composed or decomposed forms; this choice does not matter as long as it is the same for all strings involved in a search, comparison, etc. On the other hand, the choice of equivalence criteria can affect search results. For instance some [[typographic ligatures]] like U+FB03 (ffi), [[roman numerals]] like U+2168 (Ⅸ) and even [[Unicode_subscripts_and_superscripts | subscripts and superscripts]], e.g. U+2075 (⁵) have their own Unicode code points. Canonical normalization (NF) does not affect any of these, but compatibility normalization (NFK) will decompose the ffi ligature in the constituent letters, so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03. Likewise when searching for the Latin letter I (U+0049) in the precomposed Roman Numeral Ⅸ (U+2168). Similarly the superscript “⁵” (U+2075) is transformed to “5” (U+0035) by compatibility mapping.