Module talk:Unicode chart

This is the talk page for discussing improvements to the Unicode chart module.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Notes about notes

Latest comment: 5 years ago19 comments3 people in discussion

Hi, one thing which I have been thinking about for a long time (several years) is to make the Unicode code chart templates expandable to show a list of all character names (and formal character name aliases). I think this would be very helpful to users as at present the only way to know what the character name is is to hover the mouse over the character cell whilst carefully avoiding hovering over the link that people so love to add to the characters; but the mouseover text is not copyable, so it is of limited use. I have made a rough mock up of what I mean in my sandbox. What do you think? Please feel free to tweak or improve it. (I suggest that this approach is not applied for large blocks with algorithmic names). BabelStone (talk) 20:42, 22 August 2019 (UTC)Reply

@BabelStone: I think it's definitely doable if you think it's useful. And if you've been thinking about it for that long it's probably useful.

I changed the "Character names" title to "List of character names" to be painfully clear.

It is better.

Should the list be sortable? This involves adding a header. I've mocked it up in your sandbox. I'd skip this on the algorithmic ones though because the code point and name always sort the same.

Personally, I don't think sortable is particularly useful, but I don't mind.

I'm assuming aliases will use the same format as the current charts: FOO (alias BAR)

Seems reasonable.

Can we agree that there should be NO LINKED CHARACTERS in that lists? If someone wants to link each character they can do so in the existing part of the chart as far as I'm concerned. Latin Extended-B is an example of this.

I full agree that there should be no links in the names list.

There's likely to be some duplication between the template and the article text. Latin Extended-B again is a good example. I'm thinking that article text with a list of characters can be removed once this is in place so long as they don't add additional information. (I would count the decimal values provided in Latin Extended-B as not adding information except to anyone who doesn't know you can use &#xHHHH; notation.)

Yes.

Lastly, do we need to worry about added character counts for articles that include multiple charts? Could this cause them to exceed size limits?

Probably not because articles with multiple code chart templates are generally not for very large blocks, and the huge blocks with algorithmic names will only have a slight increase in size.

DRMcCreedy (talk) 22:24, 22 August 2019 (UTC)Reply

I definitely think it is useful. If users want an overview of the character names, at present they have to click on the link to Unicode code charts or go to another website. (Other replies inline above) BabelStone (talk) 10:53, 23 August 2019 (UTC)Reply

I made the ogham table sortable, but when you sort by code point it does not sort in the expected order (hex values with A..F are sorted separately from hex values comprising 0..9 only). We could overcome this by putting the code point in a {{sort}} template with a fixed width decimal value for the hidden sort parameter, but this seems like too much trouble for a marginally useful feature. BabelStone (talk) 11:06, 23 August 2019 (UTC)Reply

In light of that, let's ditch sorting. DRMcCreedy (talk) 16:07, 23 August 2019 (UTC)Reply

Agreed. Here are a few more comments and questions I have before we start implementing the change to three hundred templates BabelStone (talk) 16:55, 23 August 2019 (UTC)Reply

For blocks with algorithmic character names I think best to only list first and last assigned characters in the block. I currently put "..." between the two rows -- is that OK, or is there a better way of indicating omission of the intervening rows?

I noticed that and thought it was intuitive.

Many or most blocks have hard-coded fonts applied (in the template or using css) to the code chart glyphs (which I personally don't like). For the names list it is useful to put the character after the code point, but I don't want to hard-code the fonts to use, so I was thinking of not specifying fonts for the names list part of the table. What do you think?

I'm OK with this but anticipate others will want to add font info. I'd say let's leave font info off for now and see if there's push back.

Do we want to add any other core data for the characters? For example, we could provide a column for general category or script. Is that perhaps overkill?

I thought of that too. Probably overkill. My concern is there's almost no end of info we could add.

Should the List of character names go above or below the Notes? I'm happy with current placement below the notes, but maybe it makes more sense to put the notes at the very bottom.

I like the notes at the very bottom logically, but the list is probably easier to spot if we don't wedge it between the chart and the notes. So let's leave the list as the last item.

@BabelStone: DRMcCreedy (talk) 17:07, 23 August 2019 (UTC)Reply

Thanks for all the feedback. I think we're about there now, but I don't want to rush into making quite a large change to a large number of templates, so I'll sit on it for a week or so in case you or me or anyone else has any suggestions for improving how we do it. BabelStone (talk) 20:05, 23 August 2019 (UTC)Reply

Sounds good. The only other question that's popped into my head is combining characters. Often in the chart we'll use a dotted circle (◌) or a space with them. I'm thinking if the purpose of the table is copy-and-paste, maybe we should skip that. Not sure I feel strongly either way but that should be nailed down before the charts are created. DRMcCreedy (talk) 20:43, 23 August 2019 (UTC)Reply

I've added an example for a block with combining characters (Combining Diacritical Marks for Symbols), with plain characters for the first row and prefixed with nbsp for the second row. The unprefixed characters do not look good as they straddle the code point column, so I think prefixing with nbsp is best (I don't like the dotted circle as that often interferes with the combining mark, and makes it difficult to see clearly). BabelStone (talk) 11:24, 24 August 2019 (UTC)Reply

I've also added an example with a character name alias. BabelStone (talk) 12:58, 24 August 2019 (UTC)Reply

Looks good. I like the linked "alias". DRMcCreedy (talk) 15:56, 24 August 2019 (UTC)Reply

Comments

I'm not convinced the "Notes" section at the bottom is worth the space it takes up, and I only added it as a proof-of-concept gesture to mimic existing layout convention. A collapsible (show/hide, just like the section above) section at the bottom with an additional list/table of character info (one per line) would certainly be feasible and only require a few more lines of code. Its hugeness of screen space would be the primary concern, because its expansion would displace other page content possibly including wrapped text or floating images (unlike navboxes, which occupy 100% width at the very bottom).
We should just give first and last rows for blocks with character names derived from code points (CJK, Tangut, Nushu, ...), so the largest block is Hangul Syllables with 11,184 code points, which I agree is too long for this approach. But the next biggest blocks are Yi Syllables (1,168), Egyptian Hieroglyphs (1,072), Mathematical Alphanumeric Symbols (1,024), and Cuneiform (1,024), which I think should be acceptable if the names list is initially hidden. I don't see that displacement of other text and images would be an issue, especially as the code charts are mostly only used in the corresponding Unicode Block name articles. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply
One intuitive solution would be to mimic typical charmap program behavior by using a Javascript click handler on each character cell that populates the footer area (of about the same size as the "Notes" section, maybe slightly smaller) with the cursor-selectable name of the last clicked-upon codepoint, plus its &escapecode; and any additional info we care to pull from Module:Unicode data (replacing any previous content). I could whip up a demo for that in the next few days. I just worry that it might be too interactive to be widely accepted.
Nice idea but I am also concerned that turning Wikipedia into an app is a step too far. I'd like to see a prototype of it though. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply
A third approach might be to render the entire list (of names and whatnot) in a vertically scrollable footer panel containing "section" links, such that clicking on the character cell would cause the footer to scroll to and highlight (similar behavior to reflist anchors) the appropriate line. This might be even less popular.
I think this is the best solution, regardless of WP:SCROLL. Only 50 blocks with non-algorithmic character names have more than 128 code points, so if we make the scroll window 128 rows only the 50 largest blocks will be affected. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply
On the other hand, some philosophies may have changed over the years. I mean, we do have interactive scrolling maps that pop up in a fullscreen div now (see example).
I haven't formed any opinion yet on how to handle combining character positioning, other than "oh god, I hope it's something other than  " lol.
Personally I prefer NBSP as the base for combining characters as dotted circle (which we currently use) often interferes with the character. BabelStone (talk) 11:33, 10 September 2019 (UTC)Reply

―cobaltcigs 17:55, 9 September 2019 (UTC)Reply

Update/to-do

See Template:Unicode chart/testcases.

I've reduced the number of required parameters to only the name of the block and the version string. In reality, the former can probably be deduced (from the name of the calling template), and the latter should be exposed by Module:Unicode data in some fashion (to avoid hard-coding 12.0 on any other page) and should be updated as frequently as the data subpages are updated.
I've got it looking up the ISO 15924 and using that to select a <span> from Template:Script containing a css class for an appropriate font-family. Better would be a way to apply the class and dir attributes directly to the <td> element.
Start/end codepoints still exist as an option. The looked-up values can be overriden to subdivide a large block without confusing the module.
~~I need to debug out why it gives an error at line 38: bad argument #2 to 'format' (string expected, got nil) but only for some block names.~~
- It was because the Module:Unicode data/scripts.ranges table skips certain chars, including the Ⴧ and Ⴭ in Georgian. Added a workaround. ―cobaltcigs 22:10, 9 September 2019 (UTC)Reply

―cobaltcigs 20:49, 9 September 2019 (UTC)Reply

Existing charts

Latest comment: 5 years ago2 comments2 people in discussion

Interesting approach to create the Unicode code charts dynamically but I have many questions. Most only apply if this module is intended to replace the existing chart templates...

What problem is this new approach solving? Is it just duplicating/replacing the existing templates? If not, what will this module be used for?
Do the charts get created every time they're displayed? If so, do we care about the extra processing incurred?
How to handle fonts? I saw the post at Template talk:Script#Module:Unicode chart and the notes above so I know this is a known issue.
How to handle a varying number of reserved characters? The current charts leave off the "Gray areas" notice if there are no non-assigned code points because having the "gray areas" notice for those blocks would be confusing. And the wording changes if there is only one non-assigned code point.
How to handle charts with additional footnotes? For example, Template:Unicode chart Arabic. And for the existing charts, the notes are indeed valuable.
How to handle non-characters? For example, U+FDD0-FDEF in Template:Unicode chart Arabic Presentation Forms-A.
How to handle combining marks (which are referenced above)? Some charts have special additions for some combining characters. For example, U+A980 in Template:Unicode chart Javanese uses a dotted circle. Other combining marks, like U+1D242 in Template:Unicode chart Ancient Greek Musical Notation use a non-breaking space. Some combining marks use no additional character at all.
How to handle characters with dashed boxes? For example, U+0600-0605, 061C, and 06DD in the Template:Unicode chart Arabic chart.
How to handle control(ish) characters where we don't want the actual character in the chart? For example, U+061C in the Template:Unicode chart Arabic chart, and more obviously, control characters in Template:Unicode chart C0 Controls and Basic Latin and Template:Unicode chart C1 Controls and Latin-1 Supplement.
How to create character name aliases? See U+061C in Template:Unicode chart Arabic and the control characters in Template:Unicode chart C0 Controls and Basic Latin and Template:Unicode chart C1 Controls and Latin-1 Supplement.
How to handle block-specific formatting? For example Template:Unicode chart Javanese has a specific height and some of the characters in Template:Unicode chart Control Pictures use a different font size.
How to handle character links? Like @BabelStone:, I'm not a fan of linking specific characters (but others are). It looks like your code, optionally, will link every character if an article exists, but this could increase the number of linked characters. And many characters aren't linked to the character itself, like U+2245 in Template:Unicode chart Mathematical Operators. Some link to wikt, like U+0x2105 in Template:Unicode chart Letterlike Symbols and all the characters in Template:Unicode chart CJK Unified Ideographs Extension A.
Some blocks have special parameters that need to be taken into account: Template:Unicode chart Alphabetic Presentation Forms, Template:Unicode chart Enclosed Alphanumeric Supplement, Template:Unicode chart Enclosed CJK Letters and Months, Template:Unicode chart Halfwidth and Fullwidth Forms, Template:Unicode chart Miscellaneous Symbols, and Template:Unicode chart Supplemental Symbols and Pictographs. As with most of these questions, this only only applies if you're replacing existing chart templates.
How to determine the chart name? Most charts use the block name for the title but some don't. For example, "C0 Controls and Basic Latin" is the chart name for the "Basic Latin" block.
How to determine what to link the chart name to. For example, the Template:Unicode chart Kangxi Radicals chart links to "Kangxi radical#Unicode". Most either link to the block name itself or the block name with "(Unicode block)" appended.
Will the new approach be used for the list charts that make up List of CJK Unified Ideographs, part 1 of 4 and List of CJK Unified Ideographs Extension B (Part 1 of 7)?

DRMcCreedy (talk) 04:51, 10 September 2019 (UTC)Reply

1. Consistency of format, avoidance of stupidity like this.
2 (and 3). Here are four profiler outputs for the testcases page. Note that this is the total churning of five {{unicode chart}}s transcluded on the same page (indirectly through the {{test case}} template/module in fact). Even with those factors the processing stats are at a small fraction of allowable limits in every case except for ifexist (which should probably be the first feature taken out). Actual overhead in the wild would be lower. Based on the percentages at the bottom, it looks like the single worst bottleneck is the grand #switch statement at Template:Script. We could probably save at least 40% on parser juice by skipping that and moving its fairly trivial functionality (that of choosing a css class and a definition for same, having already obtained an ISO 15924 code from here) into some module.
4. Keeping a count of reserved codepoints and rendering the "note" as plural/singular/blank will be a trivial step. I just didn't think of it. I do question whether the footnote system is the appropriate way to present this.
5. My first version of the module actually did have a parameter accepting whole refs. I just took it out when I got the impression every existing template had the same two notes. I can put it back.
6. Preview of {{tl|unicode chart|name=Arabic Presentation Forms-A|version=12.0}} has them showing up as normally reserved codepoints (the default assumption based on lines missing from here), rather than choking. If we want to give the "permanently reserved" codepoints a different background and auto-generate a footnote explaining what this means, we'd have to maintain a list of them somewhere. Does anything like this occur in other blocks?
Also 6. I'd be more immediately concerned about this cell-stretching monstrosity at U+FDFD, which seems to be a consequence of using {{script}} in places where the original chart template does not.
7. Not sure yet. I did see some interesting suggestions here.
8. Depends on what the rationale is for drawing these boxes, and whether it can be detected in any way from Unicode data. Or whether it needs to be listed elsewhere as a special case. Or whether the boxes are needed at all. I don't see a footnote explaining what the boxes even indicate. No hints on my own system either.
9 and 10. Each of the display-aliased characters in the templates you mentioned returns false for the Module:Unicode data function .is_printable(n), except for U+0020 SPACE and U+00A0 NO-BREAK SPACE, which return true for .is_whitespace(n). So both of these traits can easily be tested. Choosing the replacement alias we want would require maintaining a list of same. I'm not sure a printable space character should be aliased in this manner. Maybe it the cell background should be a different color with a footnote explaining yes, a whitespace character is there, and yes, you can copy it and paste it elsewhere. Also not sure "XXX" is appropriate for U+0080–0081. Maybe we want to display "PAD" and "HOP" instead?
11. The existing chart for Javanese shows up with a cell height of 80px which seems excessive for the apparent line height of 33px on my screen. Preview of module output for Javanese looks fine. Better in my opinion. Maybe I just don't have the right fonts installed. But yes, cell height/width params can be added if there's a demonstrated need for this. Otherwise the browser should be trusted to stretch cells for large characters as needed. See "Also 6" above.
12. I think if they are going to be linked, they shouldn't be piped to something else unless the character itself an illegal title char and even then it shouldn't be linked to anything other than a title that paraphrases said character (e.g. [[Number sign|#]]). Making ≅ a disambiguation page (then piping the link to a more specific topic because linking to disambiguation pages is bad) was a mistake in my opinion. And nothing on Letterlike Symbols should link to wikt. Probably only the CJK Ideographs and such (which represent whole words and where wikt has, or should have, a page of that exact title which Wikipedia will never have) should link to wikt. This could be added as a separate link=wikt mode.
12, continued. If the character title is a redirect to some other page (such as a list of emojis, or an article about the subject represented by some symbol), that's fine. Someday the character itself might become a separate article, which is also fine. The template need not know or care about that. I'm thinking a list of link aliases for bad-title chars (mapping '#' to Number sign and so on) would be a good solution. But only if we're going to be linking the characters at all, which is unclear.
13. I did keep the optional start/end parameters, because I figured subdivision would be wanted in some blocks for reasons including hugeness. Note that these need not be multiples of 16. The module will pad leftover cells accordingly with <td class="excluded"> which is currently styled the same as class="reserved" but this can be changed.
14 and 15. If the unicode block display names can't be made to exactly match the "official" names in all cases, we'll need a (hopefully short) list of aliases. Adding a blocknamelink parameter which continues to default to Blockname (Unicode chart) if empty would be easy and sufficient. Let's try to avoid having three sets of names wherever possible.
16. I don't see why not. See 13.

―cobaltcigs 18:20, 10 September 2019 (UTC)Reply