Talk:Unicode/Archive 6: Difference between revisions

Content deleted Content added
m Archiving 1 discussion(s) from Talk:Unicode) (bot
m Archiving 1 discussion(s) from Talk:Unicode) (bot
Line 517:
: <code>U+200B</code> ZERO WIDTH SPACE has the [[Unicode character property]] <code>WSpace=no</code> (not a [[whitespace character]]). <small>[[Wikipedia:WikiLove|Love]]</small>&nbsp;—[[User:LiliCharlie|LiliCharlie]]&nbsp;<small>([[User talk:LiliCharlie|talk]])</small> 22:17, 6 June 2018 (UTC)
::The cited <u>[[Unicode character property]]</u> article supports ''my'' point, including {{code|U+200B}} among the "whitespace characters without Unicode character property 'WSpace=Y'". [[User:Peter M. Brown|Peter Brown]] ([[User talk:Peter M. Brown|talk]]) 18:36, 7 June 2018 (UTC)
 
== Suggestion for changing the lede ==
 
I have a couple of problems with the last paragraph (as of Mar 3,2016) of the lede (lead). First, it continues to talk about USC-2. USC-2 IS OBSOLETE and it says so. So, why is it used as an example?
It is poor pedagogy to explain an obsolete system and then compare an active system to it. Currently, the paragraph reads:
"Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UCS-2 uses a 16-bit code unit (two 8-bit bytes) for each character but cannot encode every character in the current Unicode standard. UTF-16 extends UCS-2, using one 16-bit unit for the characters that were representable in UCS-2 and two 16-bit units (4 × 8 bit) to handle each of the additional characters."
The text "Unicode can be implemented" is a hypelink to the article "Comparison of Unicode encodings".
The hypelink should be removed and a reference used, probably "[see Comparison of Unicode encodings]". This first sentence is terrible. It is not true that Unicode can be implemented by different encodings, in the sense that an encoding is NOT an implementaion. Also: I don't think Unicode 8 is fully implemented by ANY program, anywhere. Unicode's codepoints ARE (not "can be") commonly encoded using UTF-8 and UTF-16. I suggest the following:"Unicode's codepoints are commonly encoded using UTF-8 and UTF-16. Other encodings, such as the now obsolete UCS-2 or the anglo-centric ASCII may also be encountered (ASCII defines 95 characters, USC-2 allows up to 65 536 code points). Both UTF-8 and UTF-16 use a variable number of bytes for the codepoint they represent: UTF-8 uses between 1 and 4 bytes and UTF-16 uses either 2 or 4 bytes. Since 2007, when it surpassed ASCII, UTF-8 has been the dominant encoding of the World Wide Web with an estimated 86% of all web pages using it as of January 2016."[[User:Abitslow|Abitslow]] ([[User talk:Abitslow|talk]]) 22:47, 3 March 2016 (UTC)
 
I have never seen any Unicode other than UTF-8 (servers) and UTF-32 (JavaScript, and Python "unicode" objects). Shouldn't those two be listed as the two most popular forms? Basically you use UTF-8 unless you want to index individual characters; then you use UTF-32 in those special cases. Isn't that pretty much the whole story right now? And then UTF-16 is of historical interest for Windows NT.
 
: Java is firmly 16-bit for characters, and every version of Windows since XP has been Windows NT, even if they don't call it that. C# and .NET use UTF-16, as well. What's most frequent is hard to tell, and depends on what you're measuring.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 08:15, 28 September 2017 (UTC)
 
:16-bit code units are used plenty on Windows, all the system api has that, filenames in their newer filesystems use that, and many text files are written this way (that is becoming more rare rapidly however). Note there is a lot of confusion about whether Windows supports UTF-16 or UCS-2. Some software is "unaware" of UTF-16, but this does not mean it won't "work" with it. This is exactly the same reason code that code designed for ASCII "works" with UTF-8. If all the unknown sequences are copied unchanged from input to output then it "works" by any practical definition. Unfortunately a lot of people think that unless the program contains code to actively parse multi-code-unit characters, or even to go to the point that the program must apply some special meaning to a subset of those characters, then it somehow is "broken" for that encoding and "does not support it", but that is a totally useless definition as it has nothing to do with whether it will actually fail. Therefore I think it is fine to clearly say "Windows uses UTF-16".[[User:Spitzak|Spitzak]] ([[User talk:Spitzak|talk]]) 19:30, 28 September 2017 (UTC)
 
::It strikes me that UCS-2 is not an encoding for the entire Unicode code space, but only a subset. (Likewise for ASCII). As encodings of subsets, both of them are special in that they match their Unicode subset not only in order, but in numerical value of the character code. While the subset of Unicode covered by UCS-2 matches that of Unicode 1.1 in magnitude, the incompatible change in Hangul encoding in Unicode 2.0 means that UCS-2, if understood as matching the post 2.0 layout up to U+FFFF, is not a complete encoding of any version of Unicode. It seems to me, that distinction should be the basis for a reformulation that prioritizes encodings that cover all of Unicode. [[User:Ablaut490|Ablaut490]] ([[User talk:Ablaut490|talk]]) 00:25, 24 December 2018 (UTC)