Talk:Unicode/Archive 7: Difference between revisions

Content deleted Content added
m Archiving 1 discussion(s) from Talk:Unicode) (bot
m Archiving 1 discussion(s) from Talk:Unicode) (bot
Line 108:
==Other persisting "anomalies"==
The "combining class" priorities assigned to Hebrew diacritics in the early 1990s are incorrect and semi-worthless, which means that older software displays the diacritics incorrectly, while more recent software has to work around it, but apparently this is also set in stone, and nothing can be done to fix it... [[User:AnonMoos|AnonMoos]] ([[User talk:AnonMoos|talk]]) 03:06, 4 February 2019 (UTC)
 
== Names or [[glyphs]]? Response to [[User:Prosfilaes|Prosfilaes]] ==
 
Prosfilaes has [https://en.wikipedia.org/w/index.php?title=Unicode&curid=31742&diff=883970443&oldid=883708807 reverted] my replacement of code point names with glyphs, holding that "in explaining the architectures, names are more important than glyphs". I disagree. The official names play no role in the structure of Unicode. Some code points like U+0009, the tab character, do not even have official names and, of those that do, some are incorrect (see [[Talk:Unicode#Number_of_issues.|above]]) and others, like LATIN SMALL LETTER Q (which displays a capital letter that seemingly claims to be small) are confusing. The Unicode Standard nowhere says that anything depends on the name of a code point.
 
A code point with a graphic "basic type", which most of the assigned code points have, determines the general shape of its associated glyphs. The additional designation of a font makes the shape precise, and adding the point size completes the glyph specification. Code points are of interest mainly because of this association with glyphs.
 
In lower case, the Greek letter ''sigma'' has two code points, U+03C2 and U+03C3. The second applies when the letter occurs at the end of a word, the first when it occurs elsewhere. Why two, when it's the same letter, pronounced the same way? Only because the shape is not even roughly the same, ς for U+03C2 and σ for U+03C3. Glyphs that differ so radically can never represent the same code point. Unlike anything having to do with official names, this is a basic feature of Unicode architecture.
 
In contrast, the exclamation mark ' ! ' is used for the [[factorial]] function in mathematics as well as a punctuation mark ending a sentence emphatically. These are two very different uses with nothing in common but the glyph in each applicable font, yet they have the same code point, U+0021. They are not distinguished in Unicode because the distinction has no consequence for glyphs.
 
One cannot always use a glyph to designate a code point uniquely. The glyph ' P ' can represent U+0050 (the first letter in Prosfilaes' username and mine), U+03A1 (the Greek letter ''rho''), or U+0420 (the Cyrillic letter ''er''). Unique designation is usually possible, though, and—when it is—presenting glyphs as I did in the reverted text is more helpful to the average reader than is presenting the name.
 
Prosfilaes also complains that 𑀈, my example of a non-BMP character, looks too much like a plus sign, which is in the BMP. That hadn't occurred to me, but another non-BMP code point could certainly be used.
 
[[User:Peter M. Brown|Peter Brown]] ([[User talk:Peter M. Brown|talk]]) 16:55, 20 February 2019 (UTC)
 
: Unicode encodes characters, not glyphs. Identical glyphs may be used to represent different characters (as, typically, {{unichar|41|LATIN CAPITAL LETTER A}}, {{unichar|391|GREEK CAPITAL LETTER ALPHA}}, and {{unichar|410|CYRILLIC CAPITAL LETTER A}}), and completely different glyphs may represent the same character ({{unichar|41|LATIN CAPITAL LETTER A}} may look like 𝖠, 𝒜, 𝔄, etc.).
: Specifically, typical glyphs representing the character {{unichar|F7|DIVISION SIGN}} can easily be confused with {{unichar|2797|HEAVY DIVISION SIGN}}, {{unichar|1365|ETHIOPIC COLON}} or {{unichar|223B|HOMOTHETIC}}, while the "two-dot shape" of {{unichar|11008|BRAHMI LETTER II}} looks like {{unichar|A58C|VAI SYMBOL JOO}}, and its "four-dot shape" resembles {{unichar|2E2C|SQUARED FOUR DOT PUNCTUATION}}, {{unichar|2237|PROPORTION}}, {{unichar|26DA|DRIVE SLOW SIGN}}, {{unichar|2D46|TIFINAGH LETTER TUAREG YAKH}}, {{unichar|1362|ETHIOPIC FULL STOP}}, and several of the Braille patterns.
: There is no way to confidently identify an isolated character when you only see a glyph that visualises it. It is necessary to give its semantics which in most cases is reflected by its character name. <small>[[Wikipedia:WikiLove|Love]]</small>&nbsp;—[[:commons:User:LiliCharlie|LiliCharlie]]&nbsp;<small>([[User talk:LiliCharlie|talk]])</small> 18:54, 20 February 2019 (UTC)
: I think this is far over broad for the edit in question. The dispute is between "For code points in the [[Basic Multilingual Plane]] (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD)." and " For code points in the [[Basic Multilingual Plane]] (BMP), four digits are used (e.g. U+00F7 for the character ÷); for code points outside the BMP, five or six digits are used, as required (e.g. U+11008 for the character 𑀈)." As I said, this is about architecture. Yes, there are confusing names for certain Unicode characters, but this is about how many digits are used to represent that Unicode character. It doesn't matter at this layer if a name is confusing or how it might map to glyphs or user-perceived characters; just that there exists a code point labeled LATIN CAPITAL LETTER X and that it is also referenced as U+0058.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 21:20, 20 February 2019 (UTC)
 
::Overbroad, perhaps, but I do want to respond to LiliCharlie, who claimed that "completely different glyphs may represent the same character", challenging my claim that "Glyphs that differ...radically can never represent the same code point." As support, LiliCharlie writes, "{{unichar|41|LATIN CAPITAL LETTER A}} may look like 𝖠, 𝒜, 𝔄, etc." This is supportive, however, only if LiliCharlie can name fonts in which U+0041 has 𝒜 and 𝔄, respectively, as glyphs. As far as I can determine, 𝒜 has code point U+1D49C and 𝔄 has code point U+1D504. A font in which U+0041 has 𝒜 as a glyph would hardly be a sufficient challenge anyhow, as this is quite similar to A. 𝔄 admittedly differs radically, so a font in which it represents U+0041 would definitely count against my claim.
 
::I challenge LiliCharlie to explain why, in lower case, medial sigma (σ) and final sigma (ς) are assigned different code points while medial and final lower-case theta (θ) both have the one code point U+03D8. The obvious answer, though there may be another, is that the glyphs for lower-case sigma, in most or all applicable fonts, are very different.
 
::Returning to the original dispute with Prosfilaes, the choice is between
 
:::For code points in the [[Basic Multilingual Plane]] (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).
 
::and
 
:::For code points in the [[Basic Multilingual Plane]] (BMP), four digits are used (e.g. U+00F7 for the character ÷); for code points outside the BMP, five or six digits are used, as required (e.g. U+11008 for the character 𑀈).
 
::At least for the sake of argument, I concede Prosfilaes' point that 𑀈 looks too much like a plus sign and propose, in the final parenthesis,
 
:::e.g. U+2395C for the character 𣥜
 
::The real question is which phrasing is more accommodating for the typical reader. I disagree that "It doesn't matter at this layer if a name is confusing"; confusing names will confuse, which contravenes Wikipedia's objectives. The use of capitals is off-putting, especially as the reader has not been advised earlier (or, indeed, anywhere in the article) that letters in official Unicode names have to be capitalized. There is no explanation of what a language tag is; the phrase is simply sprung on the unsuspecting reader. Likewise with "private use", a phrase appearing in a quote from Joe Becker but never explained.
 
::For readers already acquainted with Unicode conventions, these considerations are not relevant. Such folks, however, are not the intended audience for Wikipedia articles.
 
::[[User:Peter M. Brown|Peter Brown]] ([[User talk:Peter M. Brown|talk]]) 23:53, 21 February 2019 (UTC)
 
::: [https://www.fonts.com/font/linotype/luthersche-fraktur Luthersche Fraktur] was the first one I found, and [https://www.fonts.com/search/all-fonts?ShowAllFonts=All&searchtext=Fraktur a search for Fraktur fonts] show that many of them use the glyph form 𝔄.
::: We're talking about code points, not characters. You're adding confusion by saying "For code points" and then saying "the character ÷". I think you're underestimating the type of reader who is reading this article, or underestimating the difficulty of the rest of the article. The fact that names are capitalized is something that you learn about Unicode by exposure, and again, for the audience, is something they'll just absorb. Anyone with any familiarity with character encoding in computers will expect that there's control characters in Unicode, like LANGUAGE TAG.
::: I object to the use of 𣥜, since that implies that Chinese is outside the BMP. Hieroglyphs or other clearly ancient script, that's completely outside the BMP, should be used, or possibly an emoji. You're giving up the ability to show a six-digit name if you insist on using characters.--[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 01:13, 22 February 2019 (UTC)
 
@[[User:Peter M. Brown|Peter Brown]]: 1. There are two major reasons why {{unichar|3C2|GREEK SMALL LETTER FINAL SIGMA}} and {{unichar|3C3|GREEK SMALL LETTER SIGMA}} were encoded separately. The first, and already sufficient, one was to ensure round-trip compatibility with encodings that had existed before Unicode, and in which the two characters were also encoded separately. And reason number two is that there are exceptions to the rule that {{angle|ς}} is used word-finally and {{angle|σ}} elsewhere, see Nick Nicholas's [http://www.opoudjis.net/dist/sigma.html ''Sigma: final vs. non-final''] which is part of the [[Thesaurus Linguae Graecae]] project. — 2. The Fraktur smart font I most often use is [http://unifraktur.sourceforge.net/maguntia.html UnifrakturMaguntia]. Its glyph for {{unichar|41|LATIN CAPITAL LETTER A}} is, of course, similar to 𝔄. <small>[[Wikipedia:WikiLove|Love]]</small>&nbsp;—[[:commons:User:LiliCharlie|LiliCharlie]]&nbsp;<small>([[User talk:LiliCharlie|talk]])</small> 10:47, 22 February 2019 (UTC)
 
:@Prosfilaes:
 
:I don't see how I'm adding confusion by saying "For code points" and then saying "the character ÷". Saying "For code points" and then saying "the character LATIN CAPITAL LETTER X" is no less guilty of confusing code points with characters. The English letter string 'LATIN CAPITAL LETTER X' is neither a code point nor a character, nor is the glyph '÷'. Both only ''designate'' characters. '÷' has the advantage that it does not presuppose any familiarity with Latin or any other well-known script. Further, any reader who <u>is</u> familiar with Latin would take exception to "LATIN CAPITAL LETTER W", an official Unicode name, since Latin did not have a W. Better just to refer to "the capital letter W".
 
:You write:
 
::The fact that names are capitalized is something that you learn about Unicode by exposure, and again, for the audience, is something they'll just absorb.
 
:This is hardly necessary. An encyclopedia is supposed to <u>tell</u> the reader things, not just expose them to usages. Even if this information is added to the article, though, "the character LATIN CAPITAL LETTER X" will strike the reader—strikes me, anyhow—as odd, since a letter string is not a character. Referring to "the English character X", (thereby distinguishing it from the Greek character &Chi;) would be much better.
 
:Yes, one expects control characters, but why not something with a name familiar to the typical reader like the carriage return U+000D?
 
:As you say, a hieroglyph would be preferable to 𣥜.
 
:@LilliCharlie: Point taken.
 
:[[User:Peter M. Brown|Peter Brown]] ([[User talk:Peter M. Brown|talk]]) 19:19, 22 February 2019 (UTC)
 
::Thousands of Wikipedia articles refer to Unicode characters by their official names in capitalized form. The reason for this is that the names are unique and normatively identify the character referred to. If we were to abandon the official Unicode character names and devise our own names (which would be original research) then there would be endless disputes about the names. You prefer to refer to "X" as "English character X" yet you must know that X is used for hundreds of other languages, so referring to "X" as an "English character" would be totally unacceptable — which is why LATIN CAPTIAL LETTER X is so much better way of referring to the character. [[User:BabelStone|BabelStone]] ([[User talk:BabelStone|talk]]) 21:42, 22 February 2019 (UTC)
 
Why is Latin "so much better" than English? Granted, the English and Latin X is also the German and Swedish X, but we need to apply some adjective—Latin, English, German, whatever—to distinguish it from the Greek &Chi;, which really is a different character. In en.wikipedia.org, the character can be clearly designated as the "English character X". In sv.wikipedia.org, it would be clearer to call it the "Svenska bokstaven X". Neither is "totally unacceptable".
 
Choosing a locution maximally clear to the expected reader is not original research. It is not research at all. Even misspelling "capital", as you did above, engenders no problem—we all know what you meant.
 
[[User:Peter M. Brown|Peter Brown]] ([[User talk:Peter M. Brown|talk]]) 23:36, 22 February 2019 (UTC)
 
: Latin, especially LATIN, is much better than English, because the English character X seems to label something English-specific, where as Latin is more likely to be taken as referring to [[Latin script]]; even if you're not familiar with that phrase, most people should recognize Latin is the ancestor of our script and take it as generic.
: I think the question comes down to learning styles, and while I'm not sure mine is better, I do think it's more encyclopedic to separate levels and talk here about the code-point level and how you write code points, like U+0050, without trying to drag in what the code points mean here. --[[User:Prosfilaes|Prosfilaes]] ([[User talk:Prosfilaes|talk]]) 06:01, 23 February 2019 (UTC)
 
: This must be a joke. While there are letters of the English alphabet (≈Latin letters regularly used in English) and punctuation marks regularly used in English, there is nothing like an "English X", a "Commonwealth English Æ" (as in ''encyclopædia'') or an "English full stop/​English period." The {{angle|X}} in ''“[[Xi'an]] is beautiful.”'' is neither a "Chinese [[Pinyin]] X" nor an "English X"; it's just the Latin [[Script (Unicode)|script]] capital letter X that is a common element of the English, the Chinese Pinyin, the Latin, and many other [[writing system]]s. <small>[[Wikipedia:WikiLove|Love]]</small>&nbsp;—[[:commons:User:LiliCharlie|LiliCharlie]]&nbsp;<small>([[User talk:LiliCharlie|talk]])</small> 13:34, 23 February 2019 (UTC)
 
Once again, the wording in question has read:
 
:For code points in the [[Basic Multilingual Plane]] (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used, as required (e.g., U+E0001 for the character LANGUAGE TAG and U+10FFFD for the character PRIVATE USE CHARACTER-10FFFD).
 
This violates [[MOS:ALLCAPS]], according to one should use capital letters for Unicode names only "when presenting tables of Unicode data, and when discussing code point names as such. Otherwise prefer unstyled, plain-English character names". The passage in question is a discussion of the designation of code points in the 'U+' format, <u>not</u> of code point names as such.
 
Adopting [[User:Prosfilaes|Prosfilaes]] suggestion that a hieroglyph be used and acknowledging [[:commons:User:LiliCharlie|LiliCharlie]]'s objection to "the English X", I am bringing the passage into accord with the MOS by replacing it with
 
:For code points in the [[Basic Multilingual Plane]] (BMP), four digits are used (e.g. U+0058 for the character 'X' in English and related languages); for code points outside the BMP, five or six digits are used, as required (e.g. U+13254 for the Egyptian [[hieroglyph]] '[[File:Hieroglyph — reed shelter.png|text-bottom|15px]]').
 
[[User:Peter M. Brown|Peter Brown]] ([[User talk:Peter M. Brown|talk]]) 19:06, 24 February 2019 (UTC)