Talk:Unicode/Archive 6: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 00:09, 7 June 2020 edit Lowercase sigmabot III (talk \| contribs) Bots, Page movers 2,449,029 edits m Archiving 1 discussion(s) from Talk:Unicode) (bot ← Previous edit		Latest revision as of 14:43, 4 March 2023 edit undo Jonesey95 (talk \| contribs) Autopatrolled, Extended confirmed users, Page movers, Mass message senders, Template editors 410,716 edits Fix Linter errors.
(6 intermediate revisions by 3 users not shown)
Line 42: :Yeah, somewhere, just not on Wikipedia. See [[WP:ELNO]]. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 10:06, 11 March 2010 (UTC) ==Formatting References== I've taken to formatting some of the bare URLs here, using the templates from [[WP:CT]]. <i>[[User:Omirocksthisworld\|<b~~><font~~ ~~color~~style="color:#32B430;">~~[[User:~~Omirocksthisworld~~\|Omirocksthisworld]]</font>~~</b>]]</i>([[User talk:Omirocksthisworld\|<~~font~~span ~~color~~style="color:#1A74E2;">~~[[User talk:Omirocksthisworld\|~~Drop a line]]</~~font~~span>]]) 21:25, 16 March 2010 (UTC) == Unicode block names capitalization (Rename and Move) == Line 153: : If you find reliable sources for criticism or even discussions for/against Unicode, feel free to add the material. However, criticism sections are not mandatory. There is none in the [[Oxygen]] article for example. --[[User:Mlewan\|Mlewan]] ([[User talk:Mlewan\|talk]]) 18:11, 26 September 2013 (UTC) :: You are obviously joking. There must be sources, as files in Unicode format take twice as much size as ANSI ones, and you cannot use simple table lookup algorithms anymore. This information is just waiting for someone speaking English to make it public. [[Special:Contributions/178.49.18.203\|178.49.18.203]] ([[User talk:178.49.18.203\|talk]]) 11:38, 27 September 2013 (UTC) :::I'm sorry, but you are mistaken. You are confusing scalar values with encodings. In Unicode, these are completely different entities. The UTF-8 byte value of {{~~UTF-8~~#invoke:Unicode convert\|getUTF8\|10A05}} is identical to UTF-16 {{~~UTF-16~~#invoke:Unicode convert\|getUTF16\|10A05}}, which are both encodings of U+10A05. When you get down to things like Z - U+005A, the UTF-8 ends up as a single byte: {{~~UTF-8~~#invoke:Unicode convert\|getUTF8\|005A}}, taking up exactly as much disk space as its ANSI encoding. The fact that it has a four digit scalar value is irrelevant to how much room it takes on disk. Stateful encodings like BOCU and SCSU can bring this efficiency in data storage to every script, and multi-script documents can actually end up with smaller file sizes than in legacy encodings. [[User:Vanisaac\|Van]][[User talk:Vanisaac\|Isaac]]<sub><small>[[WP:WikiProject Writing systems\|WS]] [[WP:WikiProject Heraldry and vexillology\|Vex]]</small></sub><sup style="margin-left:-7.0ex">[[Special:Contributions/Vanisaac\|contribs]]</sup> 13:34, 27 September 2013 (UTC) :::: Stateful encodings are not generally useful. On the other hand, the requirement to represent, let's say, letter А as 1040 instead of some sane value like 192, and implement complex algorithms to make the lookup over 2M characters' size tables possible. And the requirement to use complex algorithms for needs of obscure scripts. It is clearly a demarch to undermine software development in 2nd/3rd world countries, as 1st world ones can simply roundtrip that Unicode hassle with trivial solutions. For the first world, 1 character is always 1 byte, like it always was. [[Special:Contributions/178.49.18.203\|178.49.18.203]] ([[User talk:178.49.18.203\|talk]]) 11:55, 28 September 2013 (UTC) Line 310: এরা আপনার হৃদয়কে সারাজীবন আলোড়ীত করবে। তাই এদের সঙ্গ কখরো ত্যাগ করবেন না। খারাপ বন্ধু তা যতোই কাছের হোক না কেন, ত্যাগ করুন। নাহলে খারাপ চিন্তা আপনাকে আক্রান্ত করবে। মনে রাখবেন, ভাল চিন্তার চেয়ে খারাপ চিন্তাই মানুষকে বেশি আকর্ষন করে। <small><span class="autosigned">—Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:Monitobd\|Monitobd]] ([[User talk:Monitobd\|talk]] • [[Special:Contributions/Monitobd\|contribs]]) 12:34, 13 January 2010 (UTC)</span></small><!-- Template:Unsigned --> <!--Autosigned by SineBot--> :This isn't [[Devanagari]] ([[Hindi]]). I used script recognition software to find out what language this is, and apparently it's "[[Bishnupriya Manipuri]]". Can anyone read it? I searched everywhere, and there's not a single online translator. Should I just ignore it... [[User:Indigochild777\|'''<~~font~~span ~~face~~style="font-family:Vivaldi"; font-size~~="4~~:large;">~~'''~~<~~font~~span ~~color~~style="color:#000000;">Ind</~~font~~span><~~font~~span ~~color~~style="color:#770000;">igo</~~font~~span><~~font~~span ~~color~~style="color:#BB0000;">child~~'''~~ </~~font~~span></~~font~~span>''']] 01:42, 12 April 2010 (UTC) It is bengali(Bangla,india). <span style="font-size: smaller;" class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[Special:Contributions/112.133.214.254\|112.133.214.254]] ([[User talk:112.133.214.254\|talk]]) 06:02, 2 January 2013 (UTC)</span><!-- Template:Unsigned IP --> <!--Autosigned by SineBot--> Line 347: As of September 2016, however, Unicode is unable to properly display the fonts by default for the following unicode writing systems on most browsers (namely, [[Microsoft Edge]], [[Internet Explorer]], [[Google Chrome]] and [[Mozilla Firefox]]): {{~~large~~largediv\| [[Balinese alphabet]] (ᬅᬓ᭄ᬱᬭᬩᬮᬶ) [[Batak alphabet]] (ᯘᯮᯮᯒᯖ᯲ ᯅᯖᯂ᯲, also used for the Karo, Simalungun, Pakpak and Angkola-Mandailing languages) Line 390: As of September 2016, however, Unicode is unable to properly display the fonts by default for the following unicode writing systems on most browsers (namely, [[Microsoft Edge]], [[Internet Explorer]], [[Google Chrome]] and [[Mozilla Firefox]]): {{~~large~~largediv\| [[Balinese alphabet]] (ᬅᬓ᭄ᬱᬭᬩᬮᬶ) [[Batak alphabet]] (ᯘᯮᯮᯒᯖ᯲ ᯅᯖᯂ᯲, also used for the Karo, Simalungun, Pakpak and Angkola-Mandailing languages) Line 517: : <code>U+200B</code> ZERO WIDTH SPACE has the [[Unicode character property]] <code>WSpace=no</code> (not a [[whitespace character]]). <small>[[Wikipedia:WikiLove\|Love]]</small> —[[User:LiliCharlie\|LiliCharlie]] <small>([[User talk:LiliCharlie\|talk]])</small> 22:17, 6 June 2018 (UTC) ::The cited <u>[[Unicode character property]]</u> article supports ''my'' point, including {{code\|U+200B}} among the "whitespace characters without Unicode character property 'WSpace=Y'". [[User:Peter M. Brown\|Peter Brown]] ([[User talk:Peter M. Brown\|talk]]) 18:36, 7 June 2018 (UTC) == Suggestion for changing the lede == I have a couple of problems with the last paragraph (as of Mar 3,2016) of the lede (lead). First, it continues to talk about USC-2. USC-2 IS OBSOLETE and it says so. So, why is it used as an example? It is poor pedagogy to explain an obsolete system and then compare an active system to it. Currently, the paragraph reads: "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UCS-2 uses a 16-bit code unit (two 8-bit bytes) for each character but cannot encode every character in the current Unicode standard. UTF-16 extends UCS-2, using one 16-bit unit for the characters that were representable in UCS-2 and two 16-bit units (4 × 8 bit) to handle each of the additional characters." The text "Unicode can be implemented" is a hypelink to the article "Comparison of Unicode encodings". The hypelink should be removed and a reference used, probably "[see Comparison of Unicode encodings]". This first sentence is terrible. It is not true that Unicode can be implemented by different encodings, in the sense that an encoding is NOT an implementaion. Also: I don't think Unicode 8 is fully implemented by ANY program, anywhere. Unicode's codepoints ARE (not "can be") commonly encoded using UTF-8 and UTF-16. I suggest the following:"Unicode's codepoints are commonly encoded using UTF-8 and UTF-16. Other encodings, such as the now obsolete UCS-2 or the anglo-centric ASCII may also be encountered (ASCII defines 95 characters, USC-2 allows up to 65 536 code points). Both UTF-8 and UTF-16 use a variable number of bytes for the codepoint they represent: UTF-8 uses between 1 and 4 bytes and UTF-16 uses either 2 or 4 bytes. Since 2007, when it surpassed ASCII, UTF-8 has been the dominant encoding of the World Wide Web with an estimated 86% of all web pages using it as of January 2016."[[User:Abitslow\|Abitslow]] ([[User talk:Abitslow\|talk]]) 22:47, 3 March 2016 (UTC) I have never seen any Unicode other than UTF-8 (servers) and UTF-32 (JavaScript, and Python "unicode" objects). Shouldn't those two be listed as the two most popular forms? Basically you use UTF-8 unless you want to index individual characters; then you use UTF-32 in those special cases. Isn't that pretty much the whole story right now? And then UTF-16 is of historical interest for Windows NT. : Java is firmly 16-bit for characters, and every version of Windows since XP has been Windows NT, even if they don't call it that. C# and .NET use UTF-16, as well. What's most frequent is hard to tell, and depends on what you're measuring.--[[User:Prosfilaes\|Prosfilaes]] ([[User talk:Prosfilaes\|talk]]) 08:15, 28 September 2017 (UTC) :16-bit code units are used plenty on Windows, all the system api has that, filenames in their newer filesystems use that, and many text files are written this way (that is becoming more rare rapidly however). Note there is a lot of confusion about whether Windows supports UTF-16 or UCS-2. Some software is "unaware" of UTF-16, but this does not mean it won't "work" with it. This is exactly the same reason code that code designed for ASCII "works" with UTF-8. If all the unknown sequences are copied unchanged from input to output then it "works" by any practical definition. Unfortunately a lot of people think that unless the program contains code to actively parse multi-code-unit characters, or even to go to the point that the program must apply some special meaning to a subset of those characters, then it somehow is "broken" for that encoding and "does not support it", but that is a totally useless definition as it has nothing to do with whether it will actually fail. Therefore I think it is fine to clearly say "Windows uses UTF-16".[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 19:30, 28 September 2017 (UTC) ::It strikes me that UCS-2 is not an encoding for the entire Unicode code space, but only a subset. (Likewise for ASCII). As encodings of subsets, both of them are special in that they match their Unicode subset not only in order, but in numerical value of the character code. While the subset of Unicode covered by UCS-2 matches that of Unicode 1.1 in magnitude, the incompatible change in Hangul encoding in Unicode 2.0 means that UCS-2, if understood as matching the post 2.0 layout up to U+FFFF, is not a complete encoding of any version of Unicode. It seems to me, that distinction should be the basis for a reformulation that prioritizes encodings that cover all of Unicode. [[User:Ablaut490\|Ablaut490]] ([[User talk:Ablaut490\|talk]]) 00:25, 24 December 2018 (UTC)