Talk:Unicode/Archive 6: Difference between revisions

Browse history interactively

Content deleted Content added

VisualWikitext

Revision as of 00:09, 10 April 2018 edit Lowercase sigmabot III (talk \| contribs) Bots, Page movers 2,449,026 edits Archiving 26 discussion(s) from Talk:Unicode) (bot		Latest revision as of 14:43, 4 March 2023 edit undo Jonesey95 (talk \| contribs) Autopatrolled, Extended confirmed users, Page movers, Mass message senders, Template editors 410,716 edits Fix Linter errors.
(16 intermediate revisions by 5 users not shown)
Line 42: :Yeah, somewhere, just not on Wikipedia. See [[WP:ELNO]]. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 10:06, 11 March 2010 (UTC) ==Formatting References== I've taken to formatting some of the bare URLs here, using the templates from [[WP:CT]]. <i>[[User:Omirocksthisworld\|<b~~><font~~ ~~color~~style="color:#32B430;">~~[[User:~~Omirocksthisworld~~\|Omirocksthisworld]]</font>~~</b>]]</i>([[User talk:Omirocksthisworld\|<~~font~~span ~~color~~style="color:#1A74E2;">~~[[User talk:Omirocksthisworld\|~~Drop a line]]</~~font~~span>]]) 21:25, 16 March 2010 (UTC) == Unicode block names capitalization (Rename and Move) == Line 153: : If you find reliable sources for criticism or even discussions for/against Unicode, feel free to add the material. However, criticism sections are not mandatory. There is none in the [[Oxygen]] article for example. --[[User:Mlewan\|Mlewan]] ([[User talk:Mlewan\|talk]]) 18:11, 26 September 2013 (UTC) :: You are obviously joking. There must be sources, as files in Unicode format take twice as much size as ANSI ones, and you cannot use simple table lookup algorithms anymore. This information is just waiting for someone speaking English to make it public. [[Special:Contributions/178.49.18.203\|178.49.18.203]] ([[User talk:178.49.18.203\|talk]]) 11:38, 27 September 2013 (UTC) :::I'm sorry, but you are mistaken. You are confusing scalar values with encodings. In Unicode, these are completely different entities. The UTF-8 byte value of {{~~UTF-8~~#invoke:Unicode convert\|getUTF8\|10A05}} is identical to UTF-16 {{~~UTF-16~~#invoke:Unicode convert\|getUTF16\|10A05}}, which are both encodings of U+10A05. When you get down to things like Z - U+005A, the UTF-8 ends up as a single byte: {{~~UTF-8~~#invoke:Unicode convert\|getUTF8\|005A}}, taking up exactly as much disk space as its ANSI encoding. The fact that it has a four digit scalar value is irrelevant to how much room it takes on disk. Stateful encodings like BOCU and SCSU can bring this efficiency in data storage to every script, and multi-script documents can actually end up with smaller file sizes than in legacy encodings. [[User:Vanisaac\|Van]][[User talk:Vanisaac\|Isaac]]<sub><small>[[WP:WikiProject Writing systems\|WS]] [[WP:WikiProject Heraldry and vexillology\|Vex]]</small></sub><sup style="margin-left:-7.0ex">[[Special:Contributions/Vanisaac\|contribs]]</sup> 13:34, 27 September 2013 (UTC) :::: Stateful encodings are not generally useful. On the other hand, the requirement to represent, let's say, letter А as 1040 instead of some sane value like 192, and implement complex algorithms to make the lookup over 2M characters' size tables possible. And the requirement to use complex algorithms for needs of obscure scripts. It is clearly a demarch to undermine software development in 2nd/3rd world countries, as 1st world ones can simply roundtrip that Unicode hassle with trivial solutions. For the first world, 1 character is always 1 byte, like it always was. [[Special:Contributions/178.49.18.203\|178.49.18.203]] ([[User talk:178.49.18.203\|talk]]) 11:55, 28 September 2013 (UTC) Line 310: এরা আপনার হৃদয়কে সারাজীবন আলোড়ীত করবে। তাই এদের সঙ্গ কখরো ত্যাগ করবেন না। খারাপ বন্ধু তা যতোই কাছের হোক না কেন, ত্যাগ করুন। নাহলে খারাপ চিন্তা আপনাকে আক্রান্ত করবে। মনে রাখবেন, ভাল চিন্তার চেয়ে খারাপ চিন্তাই মানুষকে বেশি আকর্ষন করে। <small><span class="autosigned">—Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:Monitobd\|Monitobd]] ([[User talk:Monitobd\|talk]] • [[Special:Contributions/Monitobd\|contribs]]) 12:34, 13 January 2010 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot--> :This isn't [[Devanagari]] ([[Hindi]]). I used script recognition software to find out what language this is, and apparently it's "[[Bishnupriya Manipuri]]". Can anyone read it? I searched everywhere, and there's not a single online translator. Should I just ignore it... [[User:Indigochild777\|'''<~~font~~span ~~face~~style="font-family:Vivaldi"; font-size~~="4~~:large;">~~'''~~<~~font~~span ~~color~~style="color:#000000;">Ind</~~font~~span><~~font~~span ~~color~~style="color:#770000;">igo</~~font~~span><~~font~~span ~~color~~style="color:#BB0000;">child~~'''~~ </~~font~~span></~~font~~span>''']] 01:42, 12 April 2010 (UTC) It is bengali(Bangla,india). <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[Special:Contributions/112.133.214.254\|112.133.214.254]] ([[User talk:112.133.214.254\|talk]]) 06:02, 2 January 2013 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot--> Line 330: :Maybe the topic of different semantics despite ±identical representations deserves an entire section of its own, as such characters have lead to severe security concerns. For example, a fake URL [https://Μісrоsоft.com https://Μісrоsоft.com] with a mix of Latin, Greek and Cyrillic letters has to be prevented from being registered, as it might be visually indistinguishable from the all-Latin [https://Microsoft.com https://Microsoft.com]. <small>[[Wikipedia:WikiLove\|Love]]</small> —[[:commons:User:LiliCharlie\|LiliCharlie]] <small>([[User talk:LiliCharlie\|talk]])</small> 19:40, 9 March 2016 (UTC) {{Clear}} == vandalism? == the edit on 20:52, 21 May 2010 by 188.249.3.139 shouldn't be reverted? <span style="font-size: smaller;" class="autosigned">—Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[Special:Contributions/193.226.6.227\|193.226.6.227]] ([[User talk:193.226.6.227\|talk]]) </span><!-- Template:UnsignedIP --> <!--Autosigned by SineBot--> == Recent changes list == {{Recent changes in Unicode}} For your userpage: {{tlx\|Recent changes in Unicode}} {{clear}} : -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 22:05, 2 December 2014 (UTC) * '''Updated'''. We could use a tempalte that marks Unicode articles. -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 23:43, 25 May 2016 (UTC) ==Writing Systems still unable to viewed properly in Unicode== As of September 2016, however, Unicode is unable to properly display the fonts by default for the following unicode writing systems on most browsers (namely, [[Microsoft Edge]], [[Internet Explorer]], [[Google Chrome]] and [[Mozilla Firefox]]): {{largediv\| [[Balinese alphabet]] (ᬅᬓ᭄ᬱᬭᬩᬮᬶ) [[Batak alphabet]] (ᯘᯮᯮᯒᯖ᯲ ᯅᯖᯂ᯲, also used for the Karo, Simalungun, Pakpak and Angkola-Mandailing languages) [[Baybayin script]] (ᜊᜌ᜔ᜊᜌᜒᜈ᜔) [[Chakma script]] (𑄇𑄳𑄡𑄈𑄳𑄡 𑄉𑄳𑄡) [[Hanunó'o alphabet]] (ᜱᜨᜳᜨᜳᜢ) [[Limbu script]] (ᤔᤠᤱᤜᤢᤵ) [[Pollard script]] (𖼀𖼁𖼂𖼃𖼄𖼅𖼆𖼇) [[Saurashtra script]] (ꢱꣃꢬꢯ꣄ꢡ꣄ꢬ) [[Sharada script]] (𑆐𑆑𑆒𑆓𑆔𑆕𑆖𑆗𑆘) [[Sundanese script]] (ᮃᮊ᮪ᮞᮛ ᮞᮥᮔ᮪ᮓ) [[Sylheti Nagari]] (ꠍꠤꠟꠐꠤ ꠘꠣꠉꠞꠤ) [[Tai Tham alphabet]] (ᨲ᩠ᩅᩫᨾᩮᩥᩬᨦ) }} Prior to Windows 7, scripts such Burmese (မြန်မာဘာသာ), Khmer (ភាសាខ្មែរ), Lontara (ᨒᨚᨈᨑ), Cherokee (ᎠᏂᏴᏫᏯ), Coptic (ϯⲙⲉⲧⲣⲉⲙⲛ̀ⲭⲏⲙⲓ), Glagolitic (Ⰳⰾⰰⰳⱁⰾⰻⱌⰰ), Gothic (𐌲𐌿𐍄𐌹𐍃𐌺), Cunneiform (𐎨𐎡𐏁𐎱𐎡𐏁), Phags-pa (ꡖꡍꡂꡛ ꡌ), Traditional Mongolian (ᠮᠣᠨᠭᠭᠣᠯ ), Tibetan (ལྷ་སའི་སྐད་), Odia alphabet (ଓଡ଼ିଆ ) also had this font display issue but have since been resolved (ie. can now be 'seen' on most browsers). Could someone also enable these fonts to be visible on Wikipedia browsers? --[[User:Sechlainn\|Sechlainn]] ([[User talk:Sechlainn\|talk]]) 02:23, 29 September 2016 (UTC) : I don't know what you mean. Unicode is the underlying standard that makes it possible to use those scripts at all. Properly showing the texts is a matter of operating system, fonts and web browser. Even just OS and browser isn't good enough; what language packs and fonts are installed are important. There's nothing that anyone can in general do here.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 02:49, 29 September 2016 (UTC) :: {{ping\|Sechlainn}} 1. Please [[Wikipedia:No original research\|'''do not engage in original research''']]. — 2. Unicode is not intended to “display the fonts.” — 3. These are Unicode scripts, not writing systems. — 4. I can view all of the above except Sharada on my Firefox. — 5. There is no such thing as “Wikipedia browsers.” <small>[[Wikipedia:WikiLove\|Love]]</small> —[[:commons:User:LiliCharlie\|LiliCharlie]] <small>([[User talk:LiliCharlie\|talk]])</small> 03:02, 29 September 2016 (UTC) == Unicode 10.0 == This version has just been released today, can you add information for this into the article? Proof from Emojipedia [[Special:Contributions/86.22.8.235\|86.22.8.235]] ([[User talk:86.22.8.235\|talk]]) 12:03, 20 June 2017 (UTC) :I haven't seen anything on the Unicode site (http://www.unicode.org/) but will keep an eye out for an official announcement that 10.0 has been released. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 18:01, 20 June 2017 (UTC) ::Version 10.0 now shows up as the latest version at http://www.unicode.org/standard/standard.html [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 18:44, 20 June 2017 (UTC) :::And the [http://unicode.org/Public/UNIDATA/ data files] have been updated, so I think we can start updating Wikipedia now. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 19:23, 20 June 2017 (UTC) == "Presentation forms" == Can someone explain to me what a "presentation form" is? I can't find an answer anywhere. [[User:Pariah24\|Pariah24]] ([[User talk:Pariah24\|talk]]) 11:19, 10 September 2017 (UTC) :Nevermind; I found [http://unicode.org/faq/ligature_digraph.html this] [[User:Pariah24\|Pariah24]] ([[User talk:Pariah24\|talk]]) 11:23, 10 September 2017 (UTC) == Is there a unicode symbol for "still mode"? == I mean this symbol: https://www.iso.org/obp/ui#iec:grs:60417:5554 [[User:Seelentau\|Seelentau]] ([[User talk:Seelentau\|talk]]) 18:16, 12 January 2018 (UTC) :It seems not. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 19:03, 12 January 2018 (UTC) == Writing Systems still unable to viewed properly in Unicode== As of September 2016, however, Unicode is unable to properly display the fonts by default for the following unicode writing systems on most browsers (namely, [[Microsoft Edge]], [[Internet Explorer]], [[Google Chrome]] and [[Mozilla Firefox]]): {{largediv\| [[Balinese alphabet]] (ᬅᬓ᭄ᬱᬭᬩᬮᬶ) [[Batak alphabet]] (ᯘᯮᯮᯒᯖ᯲ ᯅᯖᯂ᯲, also used for the Karo, Simalungun, Pakpak and Angkola-Mandailing languages) [[Baybayin script]] (ᜊᜌ᜔ᜊᜌᜒᜈ᜔) [[Chakma script]] (𑄇𑄳𑄡𑄈𑄳𑄡 𑄉𑄳𑄡) [[Hanunó'o alphabet]] (ᜱᜨᜳᜨᜳᜢ) [[Limbu script]] (ᤔᤠᤱᤜᤢᤵ) [[Pollard script]] (𖼀𖼁𖼂𖼃𖼄𖼅𖼆𖼇) [[Saurashtra script]] (ꢱꣃꢬꢯ꣄ꢡ꣄ꢬ) [[Sharada script]] (𑆐𑆑𑆒𑆓𑆔𑆕𑆖𑆗𑆘) [[Sundanese script]] (ᮃᮊ᮪ᮞᮛ ᮞᮥᮔ᮪ᮓ) [[Sylheti Nagari]] (ꠍꠤꠟꠐꠤ ꠘꠣꠉꠞꠤ) [[Tai Tham alphabet]] (ᨲ᩠ᩅᩫᨾᩮᩥᩬᨦ) }} Prior to Windows 7, scripts such Burmese (မြန်မာဘာသာ), Khmer (ភាសាខ្មែរ), Lontara (ᨒᨚᨈᨑ), Cherokee (ᎠᏂᏴᏫᏯ), Coptic (ϯⲙⲉⲧⲣⲉⲙⲛ̀ⲭⲏⲙⲓ), Glagolitic (Ⰳⰾⰰⰳⱁⰾⰻⱌⰰ), Gothic (𐌲𐌿𐍄𐌹𐍃𐌺), Cunneiform (𐎨𐎡𐏁𐎱𐎡𐏁), Phags-pa (ꡖꡍꡂꡛ ꡌ), Traditional Mongolian (ᠮᠣᠨᠭᠭᠣᠯ ), Tibetan (ལྷ་སའི་སྐད་), Odia alphabet (ଓଡ଼ିଆ ) also had this font display issue but have since been resolved (ie. can now be 'seen' on most browsers). Could someone also enable these fonts to be visible on Wikipedia browsers? --[[User:Sechlainn\|Sechlainn]] ([[User talk:Sechlainn\|talk]]) 02:23, 29 September 2016 (UTC) : I don't know what you mean. Unicode is the underlying standard that makes it possible to use those scripts at all. Properly showing the texts is a matter of operating system, fonts and web browser. Even just OS and browser isn't good enough; what language packs and fonts are installed are important. There's nothing that anyone can in general do here.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 02:49, 29 September 2016 (UTC) :: {{ping\|Sechlainn}} 1. Please [[Wikipedia:No original research\|'''do not engage in original research''']]. — 2. Unicode is not intended to “display the fonts.” — 3. These are Unicode scripts, not writing systems. — 4. I can view all of the above except Sharada on my Firefox. — 5. There is no such thing as “Wikipedia browsers.” <small>[[Wikipedia:WikiLove\|Love]]</small> —[[:commons:User:LiliCharlie\|LiliCharlie]] <small>([[User talk:LiliCharlie\|talk]])</small> 03:02, 29 September 2016 (UTC) == Two things this STILL does poorly == First it still reads like a technical manual written by experts for experts. It still refuses to explain, upfront, what a codepoint is. The related concepts of character, glyph, as well as the fonts involved all need to be discussed, imho. It should be made clear in the lead that Unicode has numerous failures: it is unable to correct past mistakes, and is (and will almost certainly continue to be) limited by political pressure (including by sovereign states such as China and N. Korea). Some of what is in the Unicode standard is there due to political concession, and of course all of it is there due to decisions made by committee(s). That's one thing. The other is the articles virtually complete failure to tackle the Windows operating system, which is far-and-away the dominant OS in the world. Windows does not handle Unicode. In order for an application, be it a web browser or a spell-checker or a chat app, to handle Unicode, it has to work around the Windows character tables. (Of course, if the article doesn't explain what the difference is between a codepoint and a character (or "wide-character"), then you've failed before you begin. I think, and propose, that at the LEAST, a section under "Issues" should be created to simply state that despite Microsoft's continued deceptive and misleading claims about its support for Unicode, that it and its Windows OS, does not directly support Unicode. (Microsoft's Word has impressive support, but still contains large omissions of the 136,000 codepoints.)[[Special:Contributions/75.90.36.201\|75.90.36.201]] ([[User talk:75.90.36.201\|talk]]) 20:22, 9 April 2018 (UTC) :Bizarre and totally incorrect statement about Microsoft Windows not directly supporting Unicode. Of course Windows (excluding obsolete W95, W98 and ME) directly and natively supports Unicode, and no Unicode-aware application running on Windows needs to "work around the Windows character tables". [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 10:26, 10 April 2018 (UTC) ::Windows does not support UTF-8 or any other coverage of Unicode in the 8-bit api, which means standard functions to open or list files do not work for filenames with Unicode in them. This makes it impossible to write portable software using the standard functions that works with Unicode filenames, therefore Windows does not support Unicode.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 21:34, 10 April 2018 (UTC) ::: Does C# even support that "8-bit API"? What do you mean by "portable software"? I would note the POSIX standard doesn't support Unicode in file names either; only A-Za-z0-9, hyphen, period and underscore can be used in portable POSIX filenames. And non-POSIX MacOS/Plan 9/BeOS programs aren't portable, so I believe it is impossible to write portable software using Unicode.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 23:18, 10 April 2018 (UTC) ::::By "portable" I mean "source code that works on more than one platform", stop trying to redefine it as "every computer ever invented in history". Modern C/C++ compilers will preserve the 8-bit values in quoted strings and thus preserve UTF-8. Only VC++ is broken here, though you can outwit it by claiming that the source code is not Unicode (???!). POSIX allows all byte values other than '/' and null in a filename and thus allows UTF-8. POSIX does go way off course when discussing shell quoting syntax and you are right it disallows some byte values.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 01:32, 11 April 2018 (UTC) ::::Oh and OS/X works exactly as I have stated, it in fact has some of the best Unicode support, though their insistence on normalizing the filenames rather than just preserving the byte sequence is a bit problematic. But at least all the software knows the filenames are UTF-8.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 01:34, 11 April 2018 (UTC) ::::: What do you mean by more than one platform? I have no reason to believe that C# doesn't support Unicode filenames, and thus every version of Windows NT since 4.0 (and thus every version of Windows since XP) supports portable code using Unicode filenames. If any non-POSIX MacOS X program is "portable", then so is a C# program targeting NET 1.0. ::::: POSIX does not allow "all byte values other than '/' and null in a filename"; to quote David Wheeler here, https://www.dwheeler.com/essays/fixing-unix-linux-filenames.html says ::::::: For a filename to be portable across implementations conforming to POSIX.1-2008, it shall consist only of the portable filename character set as defined in Portable Filename Character Set. Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments. :::::: I then examined the Portable Filename Character Set, defined in 3.276 (“Portable Filename Character Set”); this turns out to be just A-Z, a-z, 0-9, <period>, <underscore>, and <hyphen> (aka the dash character). So it’s perfectly okay for a POSIX system to reject a non-portable filename due to it having “odd” characters or a leading hyphen. ::::: If strictly following IEEE 1003, the only major operating system standard, is important to you, filenames shall come only from that set of 65 characters. In practice it's better, but a program strictly conforming to the standard is so limited. ::::: So as far as I can tell, Windows is in the same boat as everyone else.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 22:30, 11 April 2018 (UTC) ::::::You are continuing to insist that "portable" means "it works exactly the same on every single computer ever made", while I am going by the more popular definiton of "it works on more than one computer". If you insist on such silly impossible requirements it is obvious you are refusing to admit you are wrong.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 00:55, 12 April 2018 (UTC) ::::::: Portable, as strictly conforming to the POSIX standard, would be nice. Portable, as in running on multiple operating systems, is more realistic. Portable, as in running on multiple versions of the same OS, is barely passable. "It works on more than one computer" is not the "more popular" definition, as short of being tied into specialized one-off hardware like Deep Blue, you can always image the the drive and load it into an emulator on another system.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 19:08, 12 April 2018 (UTC) ::::::::As I note below, your "POSIX" complaint is actually entirely backwards. It REDUCES the number of filenames possible on some systems, therefore it has no effect on the fact that fopen() on Unix can open all files, but cannot on Windows.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) : I think you have a point about the way we mention codepoints in the opening. : As for numerous failures, "Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems." If you understand what that says, it tells you that it this is a committee project that pays a price for backward compatibility and works with the user community. To compare and contrast, the TRON character encoding doesn't work with sovereign states like China; devoid of such political pressure, it doesn't support Zhuang or Cantonese written in Han characters. Without multinational committees and political pressure, its support of anything that's not Japanese is half-assed and generally copied from Unicode. : There are computer projects that work on the benevolent dictator standard, like the Linux kernel and Python. But I don't know of any that don't center around one chunk of source code, that involve an abstract standard with multiple equal implementations. If you have a seriously complex project, like encoding all of human writing, and it's going to be core for Microsoft and Apple and Google and Oracle, it's going to be a committee that responds to political needs. And standards are really interesting only if Microsoft and Apple and Google and Oracle care; stuff like Dart and C# may technically be standards, but users use the Google tools for Dart and follow what Google puts out, and likewise for Microsoft and C#. (Or SQL, where there is a standard with multiple implementations, but one still has to learn MSSQL and Oracle Database and MySQL separately. I don't if that's an under-specified standard, or companies just ignoring it, but certainly the solution is not listening to the companies less.)--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 23:18, 10 April 2018 (UTC) :::fopen("stringWithUnicodeInIt.æ") does not do what anybody wants on Windows. On Linux it works. Therefore support for Unicode is better on Linux than Windows, which is not very impressive for Windows...[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 01:25, 11 April 2018 (UTC) :::: On Linux it may work; but "filenames" in Linux aren't names, they aren't strings, they're arbitrary byte-sequences that don't include 00h or 2Fh, and the most reasonable interpretation of the filename as a string may require choosing a character set on a per-filename basis; in worst-case scenarios, say a user in locale zh-TW mass renamed a bunch of files to start with 檔案 (archive), without paying attention to the fact they were named by a user in locale fr-FR.ISO8859-1 ("archivé"), you can end up with a byte string that makes no sense under any single character set. :::: To put it shorter, that may work in Linux, but it may also fail to open a file with a user-visible name of "stringWithUnicodeInIt.æ", depending on locale settings. :::::No, that æ is the UTF-8 byte sequence for that character and it is unaffected by the "locale" and therefore it always works.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) :::: Not to mention that judging Windows by C alone is unfair and silly; why not C# or Python or other languages?--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 22:30, 11 April 2018 (UTC) :::::Because C# api is a Microsoft developement and they wrote the Linux version, and thus any failings are their fault.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) :::::Oddly enough while you insist that C code work on EVERY SINGLE COMPUTER EVER MADE, you seem to think "limit programming languages to C#" is A-OK. You are weird. And that file will open on Linux no matter what the "locale" is set to, that is the point. The filename is a string of bytes, just like you describe, and never ever should be dependent on "locale". Linux gets this right, Windows does not. Sorry.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 00:55, 12 April 2018 (UTC) :::::: You chose the example. "stringWithUnicodeInIt.æ" is not a string of bytes; it is a string of characters. Said characters map to bytes in various ways, and provided (and this is far from guaranteed) the compiler and the system and the user that created the file all agree on UTF-8, fopen will work. If one of them is thinking in Latin-1, and the charset of the system, filename, and compiler can all be changed independently, then it will not work.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 19:08, 12 April 2018 (UTC) :::::::No, that string has 2 bytes at the end that are the UTF-8 encoding. Any system (such as C# I guess) that turns it into a different byte string is broken.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) The problem with Windows is not being understood correctly. On Windows filesystems such as NTFS, filenames are arbitrary sequences of 16-bit words (technically UTF-16 but they allow invalid UTF-16 with unpaired surrogates and they disallow some valid UTF-16). The problem is that the equivalent of open(char* filename) on Windows, which is used by virtually every portable library including the C and C++ libraries written by Microsoft, cannot open every possible file, as there are patterns of 16-bit words that cannot be achieved by any 8-bit string. This makes it impossible, for instance, to write a piece of software that opens a arbitrary file chosen by the user. An obvious fix is to make open(char) use a translator that can* produce all valid sequences of 16-bit words, and have readdir and similar functions do the opposite translation. And Windows already provides a way to change this translation, yet it refuses to allow a setting that will work (a correct setting would be UTF-8 but also allow unpaired surrogates). The end result, which is quite obvious to anybody working with large multi-platform setups, is that you are restricted to ASCII-only filenames everywhere. This is a convincing argument for many that Windows does not support Unicode. On Unix filenames are sequences of 8-bit bytes, and it is a POSIX requirement that the 8-bit sequence passed to open(char) be used unchanged to match the filename, so you can in fact name all possible files. POSIX requirements that a certain subset of ASCII is required to work in filenames only reduces the set of filenames if a system chooses to disallow bytes outside that set, you can still name all possible files and quite a few disallowed files using open(char), so mentioning that is a red herring. C# on Unix is almost certainly using UTF-16 strings in their open() api, and are using a brain-dead converter to 8-bit strings. Their converter likely is using the "locale" to convert some unpredictable 256-code-point subset to bytes and ignoring the others. As they are in charge of the reverse converter to UTF-16 there is no reason at all to do this, they should use some fixed loss-less variation of UTF-8 in both directions. Python-3 and Qt do this which makes it work much better (but far from perfect as they botch up filenames containing invalid UTF-8, these complications are why use of UTF-16 is strongly discouraged by many). Spitzak (talk) 18:35, 12 April 2018 (UTC) : Unix disallows some valid UTF-8 strings, like any including '/' or '\0'. So what? I gave you chapter and verse above where it is not a POSIX requirement to support any 8-bit sequence passed to open; that in fact the only sequences that POSIX requires a system to handle come from a 65-character subset of ASCII (and even then, no hyphens at the start of filenames). ::Okay, I am going to try to make this clear, as your convoluted arguments got me confused as well as you. Let's say there is a system that does not allow 'Z' in a filename. Does it somehow mean that fopen() cannot open all files? NO!!!! You can still send a 'Z' to fopen and it will cause an error. You can also still send all the valid strings that don't contain 'Z'. Now lets say that there is an fopen() call that removes any 'Z' from the string, despite the fact that 'Z' is allowed in a filename. Now you can no longer open all possible files, a very serious problem! Your complaint about POSIX is the first thing I describe. The problem with Windows is the second one.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) : "there are patterns of 16-bit words that cannot be achieved by any 8-bit string." I have no idea what you're getting at here. Taken literally, that's false; a 16-bit word is two 8-bit bytes. If you were talking about null-terminated strings, then you couldn't use ASCII at all. I know of glitches in NTFS and NTFS support where you can create filenames that can't be handled by normal programs, but that's not really a Unicode support issue. What you're talking about is unclear. ::Obviously it is technically possible to make a mapping from 8-bit to 16-bit strings that can produce all possible 16-bit strings. DUH! The problem is that the set of translators Windows provides for the fopen() call does not include one that can do it, despite an obvious candidate (UTF-8 with support for unpaired surrogates).[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) : "Python-3 and Qt do this" ... and Python 3 rejects some valid filenames on Linux that can't be treated as UTF-8. ::Again, holy crap. Here is a direct quote from my text: " (but far from perfect as they botch up filenames containing invalid UTF-8, these complications are why use of UTF-16 is strongly discouraged by many)." Did you even read before typing?[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) : Millions of people can and do use Unicode filenames on Windows everyday. If you really want to avoid all compatibility issues over multiple system, differences in case sensitivity and normalization are going to bite you faster than any problem with Unicode names on Windows.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 19:32, 12 April 2018 (UTC) ::And they are using software that was not written with portable api's. If you worked in an industry that uses source code from many sources you would know that we have to give up on on any filenames that are not ASCII. A single program that uses a C++ library that takes a filename as a string (rather than an open file descriptor) will force your entire operation to ASCII-only filenames instantly. This is not a joke and it is a real problem.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 17:07, 13 April 2018 (UTC) : {{ping\|Spitzak}} I have no idea what you're talking about. "The problem is that the set of translators Windows provides for the fopen() call does not include one that can do it, despite an obvious candidate (UTF-8 with support for unpaired surrogates)" goes right into my "axe-grinding developer; not a real problem" pile. What's the problem here? ::They provide an API that allows some variable-width encodings, but refuse to support the one encoding every needs.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:17, 16 April 2018 (UTC) : "No, that string has 2 bytes at the end that are the UTF-8 encoding. Any system (such as C# I guess) that turns it into a different byte string is broken." If it's text, then you don't know and shouldn't care how it's encoded, whether it's UTF-1, SCSU, UTF-9, or UTF-32. It's not a byte string; [https://www.mediawiki.org/wiki/Unicode_normalization_considerations MediaWiki normalizes] and thus turns anything you write here potentially into other byte strings. ::I am assuming the source code is in UTF-8. If you really insist, write the string so it ends with "\xc3\xa6" which will produce the correct bytes even in brain-dead compilers that thing the "locale" is more important than the actual literal encoding of the source file.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:17, 16 April 2018 (UTC) : You say "they are using software that was not written with portable api's"; according to your definition of "it works on more than one computer", those APIs merely have to work on Windows 10 in the French in France locale to be portable. Which they do, because otherwise the French would be up in arms. Again, judging an operating system solely by languages developed at Bell Labs for Unix seems a bit ... parochial. And antiquated. ::This is a Microsoft-written library that is explicitly advertised as supporting an international standard api. And what they have will fail even if you want to "port" between Windows set to the French and the Russian locale, in that you will be unable to open the same set of files in those two locales.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:17, 16 April 2018 (UTC) : I think I've finally figured out what you're going on about; Unix C/C++ uses char to support Unicode, whereas Windows expects wchar_t if you want to handle Unicode strings (including filenames)[https://msdn.microsoft.com/en-us/library/windows/desktop/dd317748(v=vs.85).aspx][https://msdn.microsoft.com/en-us/library/windows/desktop/dd374131(v=vs.85).aspx]. It might be frustrating that they chosen this design feature, but it's hardly relevant here.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 20:53, 13 April 2018 (UTC) ::I expect an api defined as being industry-standard to be able to open all files. I don't care how the system stores filenames internally as long as the translation from 8-bit byte strings is obvious. And there is a blindingly obvious method to convert the industry-standard api to these internal filenames. The converse problem of transling 16-bit strings to 8-bit on Unix is much worse as there is not a good consensus (which is why, as you noticed, Python and Qt botch it often). So the fact is Microsoft has the really trivial easy job to fix this and they have not done so. Or are you really going to say that because internally it uses 16-bit units, that we should NEVER use 8-bit encodings? Really???[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:17, 16 April 2018 (UTC) ::Here is a typical of the thousands and thousands of patches that have been applied to "portable" source code to get it to work on Windows: https://cgit.freedesktop.org/cairo/commit/?id=84fc0ce91d1a57d20500f710abc0e17de82c67df This crap should NOT be necessary![[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:17, 16 April 2018 (UTC) I don't see how Spitzak's arguments touch [https://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#page=2 conformance as defined in the standard]. <small>[[Wikipedia:WikiLove\|Love]]</small> —[[:commons:User:LiliCharlie\|LiliCharlie]] <small>([[User talk:LiliCharlie\|talk]])</small> 21:50, 13 April 2018 (UTC) :It has nothing to do with conformance. There was a simple sentence about the FACT that you cannot open Unicode-named files using the api that Microsoft uses that most or all C and C++ libraries use. Somebody up above tried to contradict it and it went down from there.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 18:17, 16 April 2018 (UTC) :: You wrote "This makes it impossible to write portable software using the standard functions that works with Unicode filenames, therefore Windows does not support Unicode." In fact, it is impossible to use one API to access Unicode-named files on Windows, but you can use portable software in languages like Java and C# on Windows that works with Unicode filenames just fine. A system can support Unicode without supporting C/C++ in any way, or in any sane way.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 02:04, 17 April 2018 (UTC) == "code-point" vs. "character" == How is the term "character" defined in Unicode and how does it differ from "codepoint"? I miss that information in the article. --[[Special:Contributions/62.224.160.232\|62.224.160.232]] ([[User talk:62.224.160.232\|talk]]) 14:17, 17 August 2016 (UTC) :Unicode Standard sections [http://www.unicode.org/versions/Unicode9.0.0/ch02.pdf#G25564 2.4 Code Points and Characters] and [http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf#G2212 3.4 Characters and Encoding] define the terms code point and abstract character. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 18:20, 17 August 2016 (UTC) ::The [[code point]] article also covers this information. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 18:26, 17 August 2016 (UTC) == Use template:code? == Should we use the template like <nowiki>{{code\|U+012F}}</nowiki> for {{code\|U+012F}} to express Unicodetext? To me it looks sound. -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 01:25, 6 May 2010 (UTC) :Yes, I think that is a good idea. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 14:05, 17 July 2010 (UTC) ::{{done}} Somewhat differently. See {{tl\|unichar}} -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 22:02, 19 November 2010 (UTC) <br> :::Violation of [[MOS:HEX]]. [[Special:Contributions/108.71.123.44\|108.71.123.44]] ([[User talk:108.71.123.44\|talk]]) 18:32, 8 October 2016 (UTC) == Unicode 10.0 == This version has just been released today, can you add information for this into the article? Proof from Emojipedia [[Special:Contributions/86.22.8.235\|86.22.8.235]] ([[User talk:86.22.8.235\|talk]]) 12:03, 20 June 2017 (UTC) :I haven't seen anything on the Unicode site (http://www.unicode.org/) but will keep an eye out for an official announcement that 10.0 has been released. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 18:01, 20 June 2017 (UTC) ::Version 10.0 now shows up as the latest version at http://www.unicode.org/standard/standard.html [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 18:44, 20 June 2017 (UTC) :::And the [http://unicode.org/Public/UNIDATA/ data files] have been updated, so I think we can start updating Wikipedia now. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 19:23, 20 June 2017 (UTC) == "Presentation forms" == Can someone explain to me what a "presentation form" is? I can't find an answer anywhere. [[User:Pariah24\|Pariah24]] ([[User talk:Pariah24\|talk]]) 11:19, 10 September 2017 (UTC) :Nevermind; I found [http://unicode.org/faq/ligature_digraph.html this] [[User:Pariah24\|Pariah24]] ([[User talk:Pariah24\|talk]]) 11:23, 10 September 2017 (UTC) == Is there a unicode symbol for "still mode"? == I mean this symbol: https://www.iso.org/obp/ui#iec:grs:60417:5554 [[User:Seelentau\|Seelentau]] ([[User talk:Seelentau\|talk]]) 18:16, 12 January 2018 (UTC) :It seems not. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 19:03, 12 January 2018 (UTC) == Censorship of recent thread on talk page == Contrary to [https://en.wikipedia.org/w/index.php?title=Talk:Unicode&diff=844563671&oldid=844563249 this edit's] edit summary, the discussion ''did'' have criticisms of, and suggestions for changes to, the content of this article, and discussed more implementations than Windows. Further, the article already discusses implementation specific issues. First I thought it was simply deleted by {{u\|Roeschter}} but at least he/she/they placed it in the archives. Is this censorship of the talk page justified (perhaps on the unstated grounds that it violated NOTFORUM)? I don't believe so by its edit summary. [[User:DIYeditor\|—DIYeditor]] ([[User talk:DIYeditor\|talk]]) 18:52, 5 June 2018 (UTC) == The zero-width space is a space == [[Special:Contributions/75.90.36.201\|75.90.36.201]] says that the "[[Zero width space\|zero-with space]]" is not a space. What is a space, though? It is a character that contains no points with an RGB color other than FFFFFF (or whatever the background color is). The zero-width space contains no such points and is therefore a space. (75.90.36.201 does admit that it is a character.) Of course, if there were a term for characters containing no points with a color other than FE3EE7, that term would also apply to the zero-width space.[[User:Peter M. Brown\|Peter Brown]] ([[User talk:Peter M. Brown\|talk]]) 21:45, 6 June 2018 (UTC) : <code>U+200B</code> ZERO WIDTH SPACE has the [[Unicode character property]] <code>WSpace=no</code> (not a [[whitespace character]]). <small>[[Wikipedia:WikiLove\|Love]]</small> —[[User:LiliCharlie\|LiliCharlie]] <small>([[User talk:LiliCharlie\|talk]])</small> 22:17, 6 June 2018 (UTC) ::The cited <u>[[Unicode character property]]</u> article supports ''my'' point, including {{code\|U+200B}} among the "whitespace characters without Unicode character property 'WSpace=Y'". [[User:Peter M. Brown\|Peter Brown]] ([[User talk:Peter M. Brown\|talk]]) 18:36, 7 June 2018 (UTC) == Suggestion for changing the lede == I have a couple of problems with the last paragraph (as of Mar 3,2016) of the lede (lead). First, it continues to talk about USC-2. USC-2 IS OBSOLETE and it says so. So, why is it used as an example? It is poor pedagogy to explain an obsolete system and then compare an active system to it. Currently, the paragraph reads: "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UCS-2 uses a 16-bit code unit (two 8-bit bytes) for each character but cannot encode every character in the current Unicode standard. UTF-16 extends UCS-2, using one 16-bit unit for the characters that were representable in UCS-2 and two 16-bit units (4 × 8 bit) to handle each of the additional characters." The text "Unicode can be implemented" is a hypelink to the article "Comparison of Unicode encodings". The hypelink should be removed and a reference used, probably "[see Comparison of Unicode encodings]". This first sentence is terrible. It is not true that Unicode can be implemented by different encodings, in the sense that an encoding is NOT an implementaion. Also: I don't think Unicode 8 is fully implemented by ANY program, anywhere. Unicode's codepoints ARE (not "can be") commonly encoded using UTF-8 and UTF-16. I suggest the following:"Unicode's codepoints are commonly encoded using UTF-8 and UTF-16. Other encodings, such as the now obsolete UCS-2 or the anglo-centric ASCII may also be encountered (ASCII defines 95 characters, USC-2 allows up to 65 536 code points). Both UTF-8 and UTF-16 use a variable number of bytes for the codepoint they represent: UTF-8 uses between 1 and 4 bytes and UTF-16 uses either 2 or 4 bytes. Since 2007, when it surpassed ASCII, UTF-8 has been the dominant encoding of the World Wide Web with an estimated 86% of all web pages using it as of January 2016."[[User:Abitslow\|Abitslow]] ([[User talk:Abitslow\|talk]]) 22:47, 3 March 2016 (UTC) I have never seen any Unicode other than UTF-8 (servers) and UTF-32 (JavaScript, and Python "unicode" objects). Shouldn't those two be listed as the two most popular forms? Basically you use UTF-8 unless you want to index individual characters; then you use UTF-32 in those special cases. Isn't that pretty much the whole story right now? And then UTF-16 is of historical interest for Windows NT. : Java is firmly 16-bit for characters, and every version of Windows since XP has been Windows NT, even if they don't call it that. C# and .NET use UTF-16, as well. What's most frequent is hard to tell, and depends on what you're measuring.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 08:15, 28 September 2017 (UTC) :16-bit code units are used plenty on Windows, all the system api has that, filenames in their newer filesystems use that, and many text files are written this way (that is becoming more rare rapidly however). Note there is a lot of confusion about whether Windows supports UTF-16 or UCS-2. Some software is "unaware" of UTF-16, but this does not mean it won't "work" with it. This is exactly the same reason code that code designed for ASCII "works" with UTF-8. If all the unknown sequences are copied unchanged from input to output then it "works" by any practical definition. Unfortunately a lot of people think that unless the program contains code to actively parse multi-code-unit characters, or even to go to the point that the program must apply some special meaning to a subset of those characters, then it somehow is "broken" for that encoding and "does not support it", but that is a totally useless definition as it has nothing to do with whether it will actually fail. Therefore I think it is fine to clearly say "Windows uses UTF-16".[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 19:30, 28 September 2017 (UTC) ::It strikes me that UCS-2 is not an encoding for the entire Unicode code space, but only a subset. (Likewise for ASCII). As encodings of subsets, both of them are special in that they match their Unicode subset not only in order, but in numerical value of the character code. While the subset of Unicode covered by UCS-2 matches that of Unicode 1.1 in magnitude, the incompatible change in Hangul encoding in Unicode 2.0 means that UCS-2, if understood as matching the post 2.0 layout up to U+FFFF, is not a complete encoding of any version of Unicode. It seems to me, that distinction should be the basis for a reformulation that prioritizes encodings that cover all of Unicode. [[User:Ablaut490\|Ablaut490]] ([[User talk:Ablaut490\|talk]]) 00:25, 24 December 2018 (UTC)