Revision as of 14:52, 16 June 2011 edit Jorge Stolfi (talk \| contribs) Autopatrolled, Extended confirmed users, Rollbackers 27,656 edits m →Rationale for equivalence: typo ← Previous edit		Revision as of 18:58, 16 June 2011 edit undo Spitzak (talk \| contribs) Extended confirmed users 10,519 edits →Rationale for equivalence Next edit →
Line 25: :''it was desirable that two different strings in an existing encoding would translate to two different strings when translated to Unicode, therefore if any popular encoding had two ways of encoding the same character, Unicode needed to as well.'' AFAIK, this is only part of the story. The main problem (duplicated chars and composed/decomposed ambiguity) was not inherited from any single prior standard, but from the merging of multiple standards with overlapping character sets.<br/>One of the reasons was the desire to incorporate several preexisting character sets while preserving their encoding as much as possible, to simplify the migration to UNICODE. Thus, for example, the ISO-Latin-1 set is exactly incuded in the first 256 code positions, and several other national standards (Russian, Greek, Arabic, etc.) were included as well. Some attempt was made to eliminate duplication; so, for example, European punctuation is encoded only once (mostly in the Latin-1 segment). Still, some duplicates remained, such as the ANGSTROM SIGN (originating from a set of miscellaneous symbols) and the LETTER A WITH RING ABOVE (from Latin-1). Another reason was the necessary inclusion of combining diacritics: first, to allow for all possibly useful letter-accent combinations (such as the umlaut-n used by a certain rock band) without wasting an astronomical number of code points, and, second, because several preexisting standards used the decomposed form to represent accented letters. Yet another reason was to preserve traditional encoding distinctions between typographic forms of certain letters, for example the superscript and subscript digits of Latin-1, the ligatures of Postscript, Arabic, and other typographically-oriented sets, and the circled digits, half-width katakana and double-width Latin letters which had their own codes in standard Japanese charsets.<br/>All these features meant that UNICODE would allow multiple encodings for identical or very similar characters, to a much greater degree than any previous standard --- thus negating the main advantage of a standard, and making text search a nightmare. Hence the need for the standard normal forms. Canonical equivalence was introduced to cope with the first two sources of ambiguity above, while compatibility was meant to address the last one. [[User:Jorge Stolfi\|Jorge Stolfi]] ([[User talk:Jorge Stolfi\|talk]]) 14:49, 16 June 2011 (UTC) :I agree it would be nice to find a source that says the exact reasons. There are better quotes in some other Unicode articles on Wikipedia. However, except for the precomposed characters, all your reasons are the same as "an exising character set had N ways of encoding this character and thus Unicode needed N ways". :Precomposed characters were certainly mostly driven by the need to make it easy to convert existing encodings, and to make rendering readable output from most Unicode easy. There may have been existing character sets with both precomposed and combining diacritics, if so this would fall into the first explanation. But I doubt that would have led to the vast number of combined characters in Unicode.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:58, 16 June 2011 (UTC)

Talk:Unicode equivalence: Difference between revisions