Talk:Unicode: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Add topic

Revision as of 12:16, 14 July 2024 edit Lowercase sigmabot III (talk \| contribs) Bots, Page movers 2,450,065 edits m Archiving 2 discussion(s) to Talk:Unicode/Archive 7) (bot ← Previous edit		Latest revision as of 21:44, 10 August 2025 edit undo SineBot (talk \| contribs) Bots 2,564,820 edits m Signing comment by Banovercheckcross - ""
(36 intermediate revisions by 13 users not shown)
Line 22: \|leading_zeros=0 \|indexhere=yes}} ~~== Version 15 & Wikidata ==~~ ~~I am adding new blocks & data to Wikidata now. Assuming no DAB needed here, the pages are:~~ ~~{{bulleted list~~ ~~\|[[Arabic Extended-C]] — [[Template:Unicode chart Arabic Extended-C]] → WD:{{Q\|Q113956924}}~~ ~~\|[[Devanagari Extended-A]] — [[Template:Unicode chart Devanagari Extended-A]] → WD:{{Q\|Q113956904}}~~ ~~\|[[Kawi (Unicode block)]] — [[Template:Unicode chart Kawi]] → WD:{{Q\|Q113956944}}~~ ~~\|[[Kaktovik Numerals (Unicode block)]] — [[Template:Unicode chart Kaktovik Numerals]] → WD:{{Q\|Q113956957}}~~ ~~\|[[Cyrillic Extended-D]] — [[Template:Unicode chart Cyrillic Extended-D]] → WD:{{Q\|Q113956962}}~~ ~~\|[[Nag Mundari]] — [[Template:Unicode chart Nag Mundari]] → WD:{{Q\|Q113956955}}~~ ~~\|[[CJK Unified Ideographs Extension H]] — [[Template:Unicode chart CJK Unified Ideographs Extension H]] → WD:{{Q\|Q113956966}}~~ }} ~~[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 16:10, 13 September 2022 (UTC)~~ ~~:QID added -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 16:33, 13 September 2022 (UTC)~~ ~~:more listing -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 18:02, 13 September 2022 (UTC)~~ ~~::Not much time to complete this list, for me. [[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 18:12, 13 September 2022 (UTC)~~ * Note that, as far as I can see, only two content articles require the "(Unicode block)" DAB-specifier, because of name overlap. The other "X (Unicode block)" pages sould be redirects to their (unambiguously named) content Block article. See also {{tl\|Unicode blocks/overview}}. DePiep. ~~{{Recent changes in Unicode}}~~ By now, most 15.0 changes seem to be processed & updated. See REcent Changes for current edits history. -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 11:31, 21 September 2022 (UTC) :As a list of version-15.0-changes needed or done, this list is incomplete. [[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 05:36, 24 October 2022 (UTC) ~~== New Taskforce WikiProject Unicode? ==~~ A proposal is opened at [[Wikipedia_talk:WikiProject_Computing#Taskforce_WikiProject_Unicode_–_proposal\|WP:COMP § Taskforce WP Unicode –_proposal]]. Please take a look. [[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 09:35, 2 October 2022 (UTC) ~~== Code Points ==~~ The lead claims that there are currently 149 186 characters in the Standard. That's confusing! Is that actual characters or does it include unprintable code points? I know what a code point is, my point is that the lead shouldn't confuse code points with characters. (I also argue that a "control character" isn't 'really' a character, not a grapheme, but that's a fight for somewhere else.) Writing about Unicode without an early clear explanation of what a code point is, is -I think- awful pedagogy. In fact, I don't think code point - a fundamental aspect of Unicode - is even defined in the article!!!! Wow, just wow. I also would like someone to verify that Unicode has characters for color. I believe that's wrong/false/misleading. I am aware that certain emoji can be modified by a code point to change some of its color. As far as I know, this is only true with a very small set of code points, and a very very small set of colors (I don't actually know if the colors are well-defined, I'd expect so, but...). These aren't colors, but are color modifiers for those other code points. [[Special:Contributions/174.130.71.156\|174.130.71.156]] ([[User talk:174.130.71.156\|talk]]) 16:00, 13 December 2022 (UTC) :There are no color defining codes in Unicode but there are names of characters that specify a color if displayed on a color device. Searching the word color in the article shows some possibly confusing text about color but nothing outright wrong. :This article leaves a lot to be desired, if you wish to make changes, you should. It's a wiki after all. [[User:SchmuckyTheCat\|SchmuckyTheCat]] ([[User talk:SchmuckyTheCat\|talk]]) 05:43, 15 December 2022 (UTC) There are two Variation Selectors (U+FE0E and U+FE0F) which specify whether an Emoji should be ideally displayed in color or black and white, but other than that, there are no color specifications in Unicode. The term "character" and "code point" are specified in the Unicode Standard, and if you feel that the coverage here is inadequate in conveying the meaning of those terms, I absolutely encourage you to contribute content to better reflect their technical specification. For the record, any code point defined beyond "Not A Character" or "Reserved" is a "character". This means control characters and whitespace are all considered characters in Unicode, just like a letter in an alphabet, a Kanji with On and Kun readings, or a mathematical symbol. [[User:Vanisaac\|Van]][[User talk:Vanisaac\|Isaac]], GHTV<sup> [[Special:Contributions/Vanisaac\|cont]]</sup><sub style="margin-left:-3.5ex"><small>[[WP:WPWR\|WpWS]]</small></sub> 06:18, 15 December 2022 (UTC) ~~== Lead is simply wrong. ==~~ The offending sentence is:"The Unicode standard defines three and several other encodings exist, all in practice [[Variable-width encoding\|variable-length encodings]]." (Sure, you could strain to interpret that to mean "all but UTF-32", but let's keep it clear. It clearly implies all encodings are variable length. Wikipedia's own article on UTF-32 says it is fixed length. (Because it only needs to use 21 of the 32 bits for Unicode code points, it is very inefficient (and rarely used, afaik). But rarely used is not the same as "doesn't exist", and "all are variable" clearly implies it doesn't exist. I'd have to look again, are there really 3 variable Unicode encodings? I can only think of UTF-8 and UTF-16. (and some others that afaik are not "defined" in the Unicode standard (like GB18030), or that are obsolete (like UTF-7).) Replace "all" with "all common encodings" or something similar, and mention UTF-32.[[Special:Contributions/174.130.71.156\|174.130.71.156]] ([[User talk:174.130.71.156\|talk]]) 11:43, 15 December 2022 (UTC) :I think the intended meaning of this was that even if ''code points'' are fixed-size, modern Unicode is effectively variable-width, as what the user thinks is a "character" sometimes needs multiple code points.[[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 16:40, 15 December 2022 (UTC) ::Yes, Unicode includes both [[combining character]]s and [[precomposed character]]s, e.g., <{{U+\|0061}} “a” latin small letter a> <{{U+\|0308}} "¨" combining diaeresis> is equivalent to <{{U+\|00E4}} "ä" latin small letter A with diaeresis>. Further, some glyphs exist at multiple code points for historical reasons. There is a discussion of cannonical forms in the Unicode standard. --[[User:Chatul\|Shmuel (Seymour J.) Metz Username:Chatul]] ([[User talk:Chatul\|talk]]) 21:57, 15 December 2022 (UTC) ::It seems odd to me to describe code points as "fixed size". They're just an abstract number. It's when you ''encode'' (or store) the code points that you get variable lengths, at least for UTF-8, UTF-EBCDIC, and UTF-16 as described in the article. I think combining characters are a red herring for this discussion. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 23:10, 15 December 2022 (UTC) :::The Unicode standard does restrict the number of code points, so describing them as as fixed length 21-bit or 32-bit data is reasonable. [[user:Spitzak\|Spitzak]] is referring to characters, which indeed are variable length, a separate issue from the length of an encoded code point that does deserve mention. --[[User:Chatul\|Shmuel (Seymour J.) Metz Username:Chatul]] ([[User talk:Chatul\|talk]]) 17:14, 16 December 2022 (UTC) ~~== Inline mentioning ==~~ I object to the [https://en.wikipedia.org/w/index.php?title=Unicode&diff=prev&oldid=1151049361 reversal] by {{U\|Peter M. Brown}}, citing [[WP:ITALICTITLE]] inappropriately. I'd say that the name, a noun, should not be in italics. ITALICTITLE referst to the name of a ''work'', ie the work itself (play, periodic, book). However, the Unicode standard is a ''standard'', not a book &tc. not even it's publication. The Standard is abstraction: the set of rules. It is a proper noun full stop. Key is, the article title notes the subject: the standard not the book. [[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 17:04, 21 April 2023 (UTC) ~~:{{ping\|Peter M. Brown}} -[[User:DePiep\|DePiep]] ([[User talk:DePiep\|talk]]) 10:43, 23 April 2023 (UTC)~~ ~~== Why no section about missing graphemes? ==~~ I don't know if it would be manageable, but Unicode clearly does not have all commonly used symbols. A simple example is the very commonly used 'slash marks' used to count. Most reading this will be familiar with the sequence /, //, ///, ////, and <s>////</s> with the crossmark (strike-through) diagonal (top left to bottom right) rather than horizontal. (This is typical in the USA, I understand European convention is slightly different). I request the editors to consider the addition of a list of missing (but documented) symbols.[[Special:Contributions/40.142.183.146\|40.142.183.146]] ([[User talk:40.142.183.146\|talk]]) 11:49, 9 June 2023 (UTC) :Unicode's non-inclusion of tally marks is covered in {{slink\|Tally marks\|Unicode}}. I don't think it's a good idea to include it also in this article. That would open the door of listing every proposal that has not yet been accepted. [[User:Indefatigable\|Indefatigable]] ([[User talk:Indefatigable\|talk]]) 15:42, 9 June 2023 (UTC) :I also oppose this idea. The set of unencoded symbols is open-ended and may exceed the number of encoded symbols. There would also be no way to determine ''which'' unencoded symbols merit mention. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 16:01, 9 June 2023 (UTC) ~~== Proposed new writing systems to be encoded into Unicode 16 ==~~ ~~Unicode 16 is set to release in September 2024. I think the following (con)scripts definitely need to be encoded:~~ Chữ Việt Trí - an alphabet invented by Tôn Thất Chương in 2012 for Vietnamese language. It's still nicer than Latin-based Quoc Ngu and needs wide recognition as the Shavian and Hangul did. * Add support for Quikscript. * Add extra missing runes from Baconsthrope and Sedgeford and Armanen runes * Possibly add something more. ~~[[Special:Contributions/94.180.80.9\|94.180.80.9]] ([[User talk:94.180.80.9\|talk]]) 07:31, 9 July 2023 (UTC)~~ :Take a look at Unicode's FAQ for [http://www.unicode.org/faq/char_proposal.html Submitting Successful Character and Script Proposals]. Wikipedia isn't affiliated with The Unicode Consortium so requests here won't be seen or acted upon by the people who can actually add characters/scripts to the Unicode Standard. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 14:39, 9 July 2023 (UTC) == Combining macron and acute in text referencing them separately == Line 132 ⟶ 64: :I think I was the one who originally added this. Please do replace it with something more straightforward. [[User:Remsense\|<span style="border-radius:2px 0 0 2px;padding:3px;background:#1E816F;color:#fff">'''Remsense'''</span>]][[User talk:Remsense\|<span lang="zh" style="border:1px solid #1E816F;border-radius:0 2px 2px 0;padding:1px 3px;color:#000">诉</span>]] 23:10, 11 May 2024 (UTC) ::Done, using prose: {{tq\|in the range from 0 to {{val\|1114111}},}}... [[User:Tarl_N.\|<b style="color:green">Tarl N.</b>]] ([[User talk:Tarl N.#top\|<span style="color:teal">discuss</span>]]) 08:01, 12 May 2024 (UTC) == 308 characters not mentioned == The only detail for Unicode 1.0.1 is about 20902 CJK Unified Ideographs added, but in total 21204 characters were added and 6 were removed. In total, 308 characters were not mentioned at all. Did I miss something while reading the page? What happened to those characters? Can somebody at least explain to me? Apologies in advance if I wasted your time. [[User:Mucksrunt\|Mucksrunt]] ([[User talk:Mucksrunt\|talk]]) 13:31, 26 August 2024 (UTC) :The Unicode 1.0.1 changes were messy. They brought Unicode into alignment with [[Universal Coded Character Set\|ISO 10646]] and happened prior to the stability policies in place today. I don't come up with 308 characters but looking through the infoboxes for the various Unicode blocks (which I beleive are accurate), I find these changes with Unicode version 1.0.1: [[Alphabetic Presentation Forms]] (+1) [[CJK Compatibility Ideographs]] (+302) [[CJK Symbols and Punctuation]] (+0) [[CJK Unified Ideographs]] (+20,902) [[Combining Diacritical Marks]] (+2) [[Cyrillic]] (-4) [[Enclosed CJK Letters and Months]] (-1) [[Greek and Coptic]] (-9) [[Hebrew]] (-1) [[Lao]] (-5) [[Miscellaneous Technical]] (-2) [[Thai]] (-5)<br/>Additionally, the range for [[Private Use Areas]] was expanded by 768 code points. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) == Input requested on Unicode block template redesign == Hey! On a lark, I decided to try a minor redesign of the Unicode block templates while fixing the pressing issue of dark mode support—see [[Wikipedia:Village pump (technical)#Unicode block template]] and tell me any thoughts you have, as I think it's probably worthwhile to at least refresh these templates. <span style="border-radius:2px;padding:3px;background:#1E816F">[[User:Remsense\|<span style="color:#fff">'''Remsense'''</span>]]<span style="color:#fff"> ‥ </span>[[User talk:Remsense\|<span lang="zh" style="color:#fff">'''论'''</span>]]</span> 15:44, 11 September 2024 (UTC) :I have two concerns on your proposed redesign: First, the link to the Unicode PDF chart is no longer obvious to the reader as it's now a reference as opposed to being clear in the chart heading. Easy access to the PDF is especially important for not widely supported code ranges. Second, consolidating the notes onto a single line is OK for most of the cases but will be harder to understand for charts with longer notes like [https://en.wikipedia.org/wiki/Template:Unicode_chart_Hangul_Jamo Template:Unicode chart Hangul Jamo]. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 17:41, 11 September 2024 (UTC) ::Moving to a reference wasn't my idea directly, as I can see it either way. Per your second point, I would actually handle this by adding additional lines for those extra notes. <span style="border-radius:2px;padding:3px;background:#1E816F">[[User:Remsense\|<span style="color:#fff">'''Remsense'''</span>]]<span style="color:#fff"> ‥ </span>[[User talk:Remsense\|<span lang="zh" style="color:#fff">'''论'''</span>]]</span> 17:53, 11 September 2024 (UTC) :::@[[User:Drmccreedy\|Drmccreedy]] I think I've finished iterating on the design for now in response to feedback here and at the Village Pump—I'm still not totally sure how/whether to display the default footnote and the PDF code chart reference, but other than that I think it's just about ready to consider deploying. Any further thoughts? <span style="border-radius:2px;padding:3px;background:#1E816F">[[User:Remsense\|<span style="color:#fff">'''Remsense'''</span>]]<span style="color:#fff"> ‥ </span>[[User talk:Remsense\|<span lang="zh" style="color:#fff">'''论'''</span>]]</span> 05:01, 13 September 2024 (UTC) == Unicode BMP Status == According to the Unicode Roadmap, the status is not categorised. I’ve tried to categorise them: here’s the result: 0000-058F Most basic LTR scripts 0590-08FF RTL scripts 0900-109F Most Asian and Indian scripts and languages 10A0-10FF Georgian (unique part) 1100-167F Larger scripts, including UCAS, Ethiopic and Hangul 1680-16FF Historical scripts 1700-1CFF Most Asian scripts, somewhat European 1D00-1FFF Latin and other basic LTR scripts 2000-2BFF Set of symbols, including punctuation and math and currency 2C00-2CFF Latin, Glagolitic (I don't know how to categorize them) 2C80-2E7F African scripts and most LTR scripts 2E80-9FFF CJK scripts, including Japanese, Hangul Jamo and ideographs A000-A4FF Asian scripts A500-A7FF Most LTR Scripts including the Medieval, African and Asian scripts A800-ABFF Most Asian scripts AC00-D7FF Hangul / Korean D800-F8FF Surrogates & Private Use F900-FFFF Mixed scripts, especially alternative or presentation forms [[User:MarcoToa1\|MarcoToa1]] ([[User talk:MarcoToa1\|talk]]) 01:57, 26 May 2025 (UTC) :I'm not sure where you are going with this. It looks like original research, which [[WP:NOR\|isn't allowed in Wikipedia articles]]. [[User:Drmccreedy\|DRMcCreedy]] ([[User talk:Drmccreedy\|talk]]) 14:34, 27 May 2025 (UTC) :There seems to be a table like this at [[Plane_(Unicode)#Basic_Multilingual_Plane\|BMP]] that is where you want to go. [[User:Spitzak\|Spitzak]] ([[User talk:Spitzak\|talk]]) 15:18, 27 May 2025 (UTC) <!-- Template:Unsigned --><small class="autosigned">— Preceding [[Wikipedia:Signatures\|unsigned]] comment added by [[User:Banovercheckcross\|Banovercheckcross]] ([[User talk:Banovercheckcross#top\|talk]] • [[Special:Contributions/Banovercheckcross\|contribs]]) </small> <!--Autosigned by SineBot-->