Talk:Unicode: Difference between revisions

Content deleted Content added
m Archiving 1 discussion(s) to Talk:Unicode/Archive 7) (bot
m Archiving 1 discussion(s) to Talk:Unicode/Archive 7) (bot
Line 22:
|leading_zeros=0
|indexhere=yes}}
 
== Lead is simply wrong. ==
 
The offending sentence is:"The Unicode standard defines three and several other encodings exist, all in practice [[Variable-width encoding|variable-length encodings]]." (Sure, you could strain to interpret that to mean "all but UTF-32", but let's keep it clear. It clearly implies all encodings are variable length. Wikipedia's own article on UTF-32 says it is fixed length. (Because it only needs to use 21 of the 32 bits for Unicode code points, it is very inefficient (and rarely used, afaik). But rarely used is not the same as "doesn't exist", and "all are variable" clearly implies it doesn't exist. I'd have to look again, are there really 3 variable Unicode encodings? I can only think of UTF-8 and UTF-16. (and some others that afaik are not "defined" in the Unicode standard (like GB18030), or that are obsolete (like UTF-7).) Replace "all" with "all common encodings" or something similar, and mention UTF-32.[[Special:Contributions/174.130.71.156|174.130.71.156]] ([[User talk:174.130.71.156|talk]]) 11:43, 15 December 2022 (UTC)
:I think the intended meaning of this was that even if ''code points'' are fixed-size, modern Unicode is effectively variable-width, as what the user thinks is a "character" sometimes needs multiple code points.[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 16:40, 15 December 2022 (UTC)
::Yes, Unicode includes both [[combining character]]s and [[precomposed character]]s, e.g., <{{U+|0061}} “a” latin small letter a> <{{U+|0308}} "¨" combining diaeresis> is equivalent to <{{U+|00E4}} "ä" latin small letter A with diaeresis>. Further, some glyphs exist at multiple code points for historical reasons. There is a discussion of cannonical forms in the Unicode standard. --[[User:Chatul|Shmuel (Seymour J.) Metz Username:Chatul]] ([[User talk:Chatul|talk]]) 21:57, 15 December 2022 (UTC)
::It seems odd to me to describe code points as "fixed size". They're just an abstract number. It's when you ''encode'' (or store) the code points that you get variable lengths, at least for UTF-8, UTF-EBCDIC, and UTF-16 as described in the article. I think combining characters are a red herring for this discussion. [[User:Drmccreedy|DRMcCreedy]] ([[User talk:Drmccreedy|talk]]) 23:10, 15 December 2022 (UTC)
:::The Unicode standard does restrict the number of code points, so describing them as as fixed length 21-bit or 32-bit data is reasonable. [[user:Spitzak|Spitzak]] is referring to characters, which indeed are variable length, a separate issue from the length of an encoded code point that does deserve mention. --[[User:Chatul|Shmuel (Seymour J.) Metz Username:Chatul]] ([[User talk:Chatul|talk]]) 17:14, 16 December 2022 (UTC)
 
== Inline mentioning ==