Unicode character property: Difference between revisions

Content deleted Content added
expanded section Casing: and other sections
m switch to a proper ref using https instead of ftp
 
(15 intermediate revisions by 7 users not shown)
Line 1:
{{Short description|Unicode code point property names and their uses}}
{{Use British English|date=January 2025}}
The [[Unicode Standard]] assigns various properties to each Unicode character and [[code point]].<ref name="Chapter4">{{cite web|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-4/|date=September 2024|title=The Unicode Standard Version 16 |publisher=The Unicode Consortium |access-date=2024-09-13}}</ref><ref name="UAX44" />
 
Line 23:
 
=={{anchor|Name}}Name and alias==
A Unicode character is assigned a unique ''Name'' (na).<ref name="Chapter4"/> The name is composed of uppercase letters A–Z, digits 0–9, [[hyphen-minus]] and [[Space (punctuation)|space]]. Some sequences are excluded: names beginning with a space or hyphen, names ending with a space or hyphen, repeated spaces or hyphens, and space after hyphen are not allowed. The name is guaranteed to be unique within Unicode, and can be used to identify a code point and its character. Ideographic characters, of which there are tens of thousands, are named in the pattern "{{Smallcaps|{{lc:CJK UNIFIED IDEOGRAPH}}}}-''hhhh''". For example, {{unichar|4E00}}. Formatting characters also have names: {{unichar|00A0}}.
 
The following Unicode categories do not have a Name value assigned: Controls (General Category: Cc), Private use (Co), Surrogate (Cs), Non-characters (Cn) and Reserved (Cn). They may be referenced, informally, by a generic or specific meta-name, called "Code Point Labels": {{not a typo|<control>, <control-0088>, <reserved>, <noncharacter-''hhhh''>, <private-use-''hhhh''>, or <surrogate>}}. Since these labels contain "<" and ">", they can never appear in a Name, which prevents confusion.
Line 64:
 
===Casing===
The Case value is normative in Unicode. It pertains to those scripts with uppercase and the lowercase letters. Case-difference occurs in Adlam, Armenian, Cherokee, Coptic, Cyrillic, Deseret, Garay, Glagolitic, Greek, Khutsuri and Mkhedruli Georgian, Latin, Medefaidrin, Old Hungarian, Osage, Vithkuqi and Warang Citi scripts.
 
<!--(upper, lower, title, folding—both simple and full)-->
Line 76:
In Greek, the letter sigma has different lowercase forms depending on where it is in a word. {{Unichar|03a3}} converts to {{Unichar|03c3}} if it is at the start or middle of a word, and converts to {{Unichar|03c2}} if it is at the end of a word.
 
In Lithuanian, the dot in lowercase i and j is preserved when followed by accents. For example: Í in lowercase is i̇́.<ref>[http{{Cite web|url=https://ftpwww.unicode.org/Public/UNIDATAUCD/latest/ucd/SpecialCasing.txt]|title=Unicode Character Database: Special Casing Data|date=2024-05-10}}</ref>
 
Despite the existence of {{Unichar|1E9E}}, {{Unichar|00DF}} corresponds to "SS".