Revision as of 11:03, 11 September 2019 edit BabelStone (talk \| contribs) Extended confirmed users 15,629 edits →Existing charts: control characters ← Previous edit		Revision as of 16:15, 11 September 2019 edit undo Cobaltcigs (talk \| contribs) Autopatrolled, Extended confirmed users 28,653 edits update Next edit →
Line 66: # Do the charts get created every time they're displayed? If so, do we care about the extra processing incurred? # How to handle fonts? I saw the post at [[Template talk:Script#Module:Unicode chart]] and the notes above so I know this is a known issue. # {{done}} How to handle a varying number of reserved characters? The current charts leave off the "Gray areas" notice if there are no non-assigned code points because having the "gray areas" notice for those blocks would be confusing. And the wording changes if there is only one non-assigned code point. # {{done}} How to handle charts with additional footnotes? For example, [[Template:Unicode chart Arabic]]. And for the existing charts, the notes are indeed valuable. # {{done}} How to handle non-characters? For example, U+FDD0-FDEF in [[Template:Unicode chart Arabic Presentation Forms-A]]. # How to handle combining marks (which are referenced above)? Some charts have special additions for some combining characters. For example, U+A980 in [[Template:Unicode chart Javanese]] uses a dotted circle. Other combining marks, like U+1D242 in [[Template:Unicode chart Ancient Greek Musical Notation]] use a non-breaking space. Some combining marks use no additional character at all. # How to handle characters with dashed boxes? For example, U+0600-0605, 061C, and 06DD in the [[Template:Unicode chart Arabic]] chart. Line 84: 2 (and 3). [https://i.imgur.com/WuSuoAT.png Here are] four profiler outputs for the [[Template:Unicode chart/testcases\|testcases]] page. Note that this is the total churning of five {{tl\|unicode chart}}s transcluded on the same page (indirectly through the {{tl\|test case}} template/module in fact). Even with those factors the processing stats are at a small fraction of allowable limits in every case except for <code>ifexist</code> (which should probably be the first feature taken out). Actual overhead in the wild would be lower. Based on the percentages at the bottom, it looks like the single worst bottleneck is the grand <code>#switch</code> statement at [[Template:Script]]. We could probably save at least 40% on parser juice by skipping that and moving its fairly trivial functionality (that of choosing a css class and a [[Template:Script/styles hebrew.css\|definition for same]], having already obtained an ISO 15924 code from [[Module:Unicode data/scripts\|here]]) into some module. Note: I'd like to get away from using {{tl\|script}} anyway if possible, for reasons outlined at [[Template talk:Script#Module:Unicode chart]]. ―[[special:contributions/cobaltcigs\|cobaltcigs]] 20:21, 10 September 2019 (UTC) 4. {{done}} Keeping a count of reserved codepoints and rendering the "note" as plural/singular/blank will be a trivial step. I just didn't think of it. I do question whether the footnote system is the appropriate way to present this. 5. {{done}} My first version of the module actually did have a parameter accepting whole refs. I just took it out when I got the impression every existing template had the same two notes. I can put it back. 6. {{done}} <s>Preview of <code><nowiki>{{tl\|unicode chart\|name=Arabic Presentation Forms-A\|version=12.0}}</nowiki></code> has them showing up as normally reserved codepoints (the default assumption based on [[Module:Unicode_data/names/00F\|lines missing from here]]), rather than choking. If we want to give the "permanently reserved" codepoints a different background and auto-generate a footnote explaining what this means, we'd have to maintain a list of them somewhere. Does anything like this occur in other blocks?</s> *These are in fact easily detectable. Disregard above comments. ―[[special:contributions/cobaltcigs\|cobaltcigs]] 16:15, 11 September 2019 (UTC) Also 6. I'd be more immediately concerned about this cell-stretching monstrosity at [https://i.imgur.com/QzF7oVa.png U+FDFD], which seems to be a consequence of using {{tl\|script}} in places where the original chart template does not. 7. Not sure yet. I did see some interesting suggestions [https://stackoverflow.com/q/26407896 here]. Line 94 ⟶ 95: 12. I think if they are going to be linked, they shouldn't be piped to something else unless the character itself an [[mw:Manual:$wgLegalTitleChars\|illegal title char]] and even then it shouldn't be linked to anything other than a title that paraphrases said character (e.g. <code><nowiki>[[Number sign\|#]]</nowiki></code>). Making [[≅]] a disambiguation page (then piping the link to a more specific topic because linking to disambiguation pages is bad) was a mistake in my opinion. And nothing on [[Template:Unicode chart Letterlike Symbols\|Letterlike Symbols]] should link to wikt. Probably only the CJK Ideographs and such (which represent whole words and where wikt has, or should have, a page of that exact title which Wikipedia will never have) should link to wikt. This could be added as a separate <code>link=wikt</code> mode. 12, continued. If the character title is a redirect to some other page (such as a list of emojis, or an article about the subject represented by some symbol), that's fine. Someday the character itself might become a separate article, which is also fine. The template need not know or care about that. I'm thinking a list of link aliases for bad-title chars (mapping <code>'#'</code> to <code>Number sign</code> and so on) would be a good solution. But only if we're going to be linking the characters at all, which is unclear. 13.{{done}} <s>I did keep the optional start/end parameters, because</s> I figured subdivision would be wanted in some blocks for reasons including hugeness. Note that these need not be multiples of 16. The module will pad leftover cells accordingly with <code><nowiki><td class="excluded"></nowiki></code> which is currently styled the same as <code>class="reserved"</code> but this can be changed. *<code>start</code>/<code>end</code> parameters have been scrapped in favor of a single <code>range</code> parameter which can contain multiple ranges (connected by hyphen or en dash, and separated from each other by comma, whitespace, the word "and", or in fact anything that's not a hex digit). 14 and 15. If the unicode block display names can't be made to exactly match the [[Module:Unicode data/blocks\|"official" names]] in all cases, we'll need a (hopefully short) list of aliases. Adding a blocknamelink parameter which continues to default to <code>Blockname (Unicode chart)</code> if empty would be easy and sufficient. Let's try to avoid having three sets of names wherever possible. {{done}} 16. I don't see why not. See 13. ―[[special:contributions/cobaltcigs\|cobaltcigs]] 18:20, 10 September 2019 (UTC) Line 118 ⟶ 120: 3 Could the font just be a passed parm? Most charts don't use a specific font. 5 The following blocks have specific footnotes: [[Template:Emoji (Unicode block)]], [[Template:Unicode chart Hangul Jamo]], [[Template:Unicode chart Superscripts and Subscripts]], and [[Template:Unicode chart Sutton SignWriting]]. Additionally, blocks with non-characters have the "Black areas indicate noncharacters (code points that are guaranteed never to be assigned as encoded characters in the Unicode Standard)" footnote: [[Template:Unicode chart Arabic Presentation Forms-A]] and [[Template:Unicode chart Specials]]. And these blocks have deprecated notes: [[Template:Unicode chart General Punctuation]], [[Template:Unicode chart Khmer]], [[Template:Unicode chart Miscellaneous Technical]], [[Tags (Unicode block)]], and [[Template:Unicode chart Tibetan]]. 6 {{done}} There are only 66 non-characters (https://www.unicode.org/faq/private_use.html#nonchar3) and Unicode has promised not to add any more. I think the black background is effective and would want to keep it. I think it's safer not to put non-characters themselves into the charts as they are "not normally interchanged with other users" (https://www.unicode.org/faq/private_use.html#nonchar2). The code points are U+FDD0-FDEF, FFFE-FFFF, 1FFFE-1FFFF, 2FFFE-2FFFF, 3FFFE-3FFFF, 4FFFE-4FFFF, 5FFFE-5FFFF, 6FFFE-6FFFF, 7FFFE-7FFFF, 8FFFE-8FFFF, 9FFFE-9FFFF, AFFFE-AFFFF, BFFFE-BFFFF, CFFFE-CFFFF, DFFFE-DFFFF, EFFFE-EFFFF, FFFFE-FFFFF, and 10FFFE-10FFFF. 8 The "Dashed Box Convention" is explained at https://www.unicode.org/versions/Unicode12.0.0/ch24.pdf#G8175 It's an oversight not having a note explaing this convention. It was added to match Unicode's charts. I think it's useful. Depending on the font, without the dashed box U+0602 is easily confusable with U+060E, U+1F1E6 looks the same as captial A, etc. As far as I know there's no way to determine which characters get a dashed box programmatically. As of version 12.1 it's used on U+0000-0020, 007F-00A0, 00AD, 034F, 0600-0605, 061C, 06DD, 070F, 08E2, 0CF1-0CF2, 0D4E, 0F0C, 1039, 115F-1160, 17B4-17B5, 17D2, 180B-180E, 1A60, 1BAB, 1CF5-1CF6, 2000-200F, 2011, 2028-202F, 205F-2064, 2066-206F, 2D7F, 2E3A-2E3B, 3000, 303E, 3164, AAF6, FE00-FE0F, FEFF, FFA0, FFF9-FFFB, 10A3F, 11003-11004, 1107F, 110BD, 110CD, 111C2-111C3, 11A3A, 11A47, 11A84-11A89, 11A99, 11D45-11D46, 11D97, 13430-13438, 16F8F-16F92, 1BC9D, 1BCA0-1BCA3, 1D159, 1D173-1D17A, 1DA9B-1DA9F, 1DAA1-1DAAF, 1F1E6-1F1FF, E0001, E0020-E007F, and E0100-E01EF. 10 Unicode charts use XXX (in a dotted box) for U+0080, 0081, and 0099 and I don't think Wikipedia's charts should contradict the cited source. (For some archane history of these three characters, I recommend http://unicode.org/pipermail/unicode/2015-October/002876.html) I think the only way of determining the abbreviations to use in the charts is a hardcoded table. They don't always match an alias. For example U+E007F is displayed as "END". A lot of the code points that use the dashed box convention display abbreviations. I haven't compiled a definitive list. Line 125 ⟶ 127: 9 I think the current solution to control characters and invisible format characters is best, i.e. use the acronym or abbreviation in a dotted square, following the example of the official Unicode code charts. The new [[User:BabelStone/sandbox#Basic_Latin\|Basic Latin]] and [[User:BabelStone/sandbox#Latin-1_Supplement\|Latin-1 Supplement]] charts show the control codes as reserved which is incorrect (they are assigned, with the general category Cc, but do not have formal character names, although they do have formal character name aliases). I also notice that U+003D (=) and U+007C (\|) do not display properly. [[User:BabelStone\|BabelStone]] ([[User talk:BabelStone\|talk]]) 11:02, 11 September 2019 (UTC) Update: I've restored the <code>refs</code> parameter. Any refs inputted here will be numbered before the auto-generated refs. Perhaps I should also have it sanitize anything that's not actually a <code><nowiki><ref></nowiki><code> by wrapping it in a <code><nowiki><ref></nowiki><code> tag so it doesn't appear in the title bar. * I've added a <code>range</code> parameter that allows multiple ranges to be specified. Potentially in the wrong order, even. Perhaps they should be force-sorted ascendingly. And sanitized to avoid duplication due to overlap. * Black blocks were actually easy to detect. Previous code assumed anything containing "<" was <code><reserved-NNNN></code> when it can actually be <code><noncharacter-NNNN></code> or <code><control-NNNN></code>. Whoops. It's all right there in [[Module:Unicode data]]. Will work on control chars next. ―[[special:contributions/cobaltcigs\|cobaltcigs]] 16:15, 11 September 2019 (UTC)

Module talk:Unicode chart: Difference between revisions