Content deleted Content added
updated version →Character name alias |
Drmccreedy (talk | contribs) m switch to a proper ref using https instead of ftp |
||
(17 intermediate revisions by 8 users not shown) | |||
Line 1:
{{Short description|Unicode code
{{Use British English|date=January 2025}}
The [[Unicode Standard]] assigns various properties to each Unicode character and [[code point]].<ref name="Chapter4">{{cite web|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-4/|date=September 2024|title=The Unicode Standard Version 16 |publisher=The Unicode Consortium |access-date=2024-09-13}}</ref><ref name="UAX44" />
The properties can be used to handle characters (code points) in processes, like in line-breaking, script direction right-to-left or applying controls. Some "character properties" are also defined for code points that have no character assigned and code points that are
Properties have levels of forcefulness: normative, informative, contributory, or provisional. For simplicity of specification, a character property can be assigned by specifying a continuous range of code points that have the same property.<ref>{{cite web|url=https://www.unicode.org/reports/tr44/#Code_Point_Ranges|title=Unicode Standard Annex #44: Unicode Character Database, 4.2.3 Code Point Ranges|website=Unicode |date=2024-08-27}}</ref>
Line 11 ⟶ 12:
[code];[name];[gc];[cc];[bc];[decomposition];[nv-dec];[nv-dig];[nv-num];[bm];[alias];;[upper case];[lower case];[title case]
*
*
*
*
*<code>decomposition</code> type or <mapping> = letter + diacritic, ligature X Y, superscript X, font X, initial X, medial X, final X, isolated X, vertical X, etc.
*
*
The property between
=={{anchor|Name}}Name and alias==
A Unicode character is assigned a unique ''Name'' (na).<ref name="Chapter4"/> The name is composed of uppercase letters A–Z, digits 0–9, [[hyphen-minus]] and [[
The following
==={{anchor|Version 1.0 names}}Unicode 1.0 names===▼
▲The following classes of code point do not have a Name (na=""): Controls (General Category: Cc), Private use (Co), Surrogate (Cs), Non-characters (Cn) and Reserved (Cn). They may be referenced, informally, by a generic or specific meta-name, called "Code Point Labels": {{not a typo|<control>, <control-0088>, <reserved>, <noncharacter-''hhhh''>, <private-use-''hhhh''>, or <surrogate>}}. Since these labels contain <>-brackets, they can never appear as a Name, which prevents confusion.
In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused
For example, {{Unichar|264}} has the Unicode 1.0 name "LATIN SMALL LETTER BABY GAMMA".
▲===Version 1.0 names===
▲In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused version 1.0-names were moved to the property Alias, to provide some backward compatibility.
===Character name alias===
{{main|Unicode alias names and abbreviations}}
Starting from Unicode
In addition to character name aliases which are corrections to defective character names, some characters are assigned aliases which are alternative names or abbreviations. Five types of character name aliases are defined in the Unicode Standard:
Line 43:
* Abbreviation: Abbreviations or acronyms for control codes, format characters, spaces, and variation selectors.
All formal character name aliases follow the rules for permissible character names, and are guaranteed to be unique within both the character name alias and the character name namespaces (for this reason, the ISO 6429 name "BELL" is not defined as an alias for {{unichar|0007}} because U+1F514 is named "BELL"; U+0007 instead has the alias
As of Unicode
Apart from these normative names, ''informal names'' may be shown in the Unicode code charts. These are other commonly used names for a character, and do not have the same character restriction. These informal names are not guaranteed to be unique, and may be changed or removed in later versions of the standard.
Line 60:
===Whitespace===
{{main|Whitespace character}}
'''Whitespace''' is a commonly used concept for a typographic effect. Basically it covers invisible characters that have a spacing effect in rendered text. It includes [[Space (punctuation)|spaces]], tabs, and new line formatting controls. In Unicode, such a character has the property set
{{Whitespace (Unicode)|state=collapsed}}
===Casing===
The Case value is
<!--(upper, lower, title, folding—both simple and full)-->▼
Different languages have different case mapping rules.
In Turkish, {{Unichar|0069}} corresponds to {{Unichar|0130}} instead of {{Unichar|0049}}. Similarly, {{Unichar|0049}} when corresponds to {{Unichar|0131}} instead of {{Unichar|0069}}.
In [[Nawdm]], the letter Ĥ corresponds to ɦ in lowercase instead of the usual case mappings being Ĥĥ and Ɦɦ.
In Greek, the letter sigma has different lowercase forms depending on where it is in a word. {{Unichar|03a3}} converts to {{Unichar|03c3}} if it is at the start or middle of a word, and converts to {{Unichar|03c2}} if it is at the end of a word.
In Lithuanian, the dot in lowercase i and j is preserved when followed by accents. For example: Í in lowercase is i̇́.<ref>{{Cite web|url=https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt|title=Unicode Character Database: Special Casing Data|date=2024-05-10}}</ref>
Despite the existence of {{Unichar|1E9E}}, {{Unichar|00DF}} corresponds to "SS".
Unicode encodes 31 titlecase characters.
* {{Unichar|01C5}}
* {{Unichar|01C8}}
* {{Unichar|01CB}}
* {{Unichar|01F2}}
* {{Unichar|1F88}}
* {{Unichar|1F89}}
* {{Unichar|1F8A}}
* {{Unichar|1F8B}}
* {{Unichar|1F8C}}
* {{Unichar|1F8D}}
* {{Unichar|1F8E}}
* {{Unichar|1F8F}}
* {{Unichar|1F98}}
* {{Unichar|1F99}}
* {{Unichar|1F9A}}
* {{Unichar|1F9B}}
* {{Unichar|1F9C}}
* {{Unichar|1F9D}}
* {{Unichar|1F9E}}
* {{Unichar|1F9F}}
* {{Unichar|1FA8}}
* {{Unichar|1FA9}}
* {{Unichar|1FAA}}
* {{Unichar|1FAB}}
* {{Unichar|1FAC}}
* {{Unichar|1FAD}}
* {{Unichar|1FAE}}
* {{Unichar|1FAF}}
* {{Unichar|1FBC}}
* {{Unichar|1FCC}}
* {{Unichar|1FFC}}
▲(upper, lower, title, folding—both simple and full)
{{expand section|date=March 2022}}
Line 76 ⟶ 122:
{{more|Combining character}}
Some common codes:
:0 = spacing letter, symbol or modifier (e.g. {{Char|a}}, {{Char|(}}, {{Char|ʰ}})
:1 = overlay
:6 = Han reading (CJK diacritic reading marks)
Line 111 ⟶ 157:
Six character properties pertain to bi-directional writing: ''Bidi_Class'', ''Bidi_Control'', ''Bidi_Mirrored'', ''Bidi_Mirroring_Glyph'', ''Bidi_Paired_Bracket'' and ''Bidi_Paired_Bracket_Type''.
One of Unicode's major features is support of bi-directional (''Bidi'') text display right-to-left (R-to-L) and left-to-right (L-to-R). The Unicode Bidirectional Algorithm UAX9<ref name="UAX9">{{cite web|url=https://www.unicode.org/reports/tr9/ |title=Unicode Standard Annex #9: Unicode Bidirectional Algorithm|work=The Unicode Standard|date=2024-09-02}}</ref> describes the process of presenting text with altering script directions. For example, it enables a Hebrew quote in an English text. The ''Bidi_Character_Type'' marks a character's behaviour in directional writing. To override a direction, Unicode has defined special ''formatting control characters'' (''Bidi-Control''
Each code point has a property called ''Bidi_Class''. It defines its behaviour in a bidirectional text as interpreted by the algorithm:
Line 117 ⟶ 163:
{{Bidi Class (Unicode)}}
In normal situations, the algorithm can determine the direction of a text by this character property. To control more complex Bidi situations, e.g. when an English text has a Hebrew quote, extra options are added to Unicode.
Basically, the algorithm determines a sequence of characters with the same strong direction type (R-to-L ''or'' L-to-R), taking in account an overruling by the special Bidi-controls. Number strings (Weak types) are assigned a direction according to their strong environment, as are Neutral characters. Finally, the characters are displayed per a string's direction.
Line 135 ⟶ 181:
===Hexadecimal digits===
[[Hexadecimal]] characters are those in the series with hexadecimal values
{{Hexadecimal digit (Unicode)}}
Line 145 ⟶ 191:
==Block==
{{main|Unicode block}}
A ''block'' is a uniquely named, contiguous range of code points. It is identified by its first and last code point. Blocks do not [[intersection (set theory)|overlap]], nor do they extend across planes. The number of code points in each block must be a multiple of 16. A block may contain code points that are reserved, not-assigned, etc. Each character that ''is'' assigned, has a single "block name" value from the 338 names assigned as of Unicode version {{Unicode version|version=16.0}}. Unassigned code points outside of an existing block have the default value "No_block".
{{Unicode blocks|state=mw-collapsed}}
Line 165 ⟶ 211:
==Age==
''Age'' is the version of the
==Deprecated==
Line 188 ⟶ 234:
|U+0627 U+065F
|اٟ
|
|-
|U+0F77
Line 270 ⟶ 316:
* Line
* Sentence
{{expand section|date=January 2025}}
==Alias name==
Line 279 ⟶ 326:
;2. Control
:[[ISO 6429]] names for C0 and C1 control functions and similar commonly occurring names, are added as an alias to the character.
:For example, {{unichar|0008}} has the alias {{smallcaps2|BACKSPACE}}.
;3. Correction
:This is a correction for a "serious problem" in the primary character name, usually an error.
:For example, {{unichar|2118
;4. Alternate
:A widely used alternate name for a character.
:Example: {{unichar|FEFF|ZERO WIDTH NO-BREAK SPACE}} has the alternate alias {{smallcaps2|1=BYTE ORDER MARK}}.
;5. Figment
:Several documented labels for C1 control code points which were never actually approved in any standard (
:For example, {{unichar|0099}} has the figment alias {{smallcaps2|1=SINGLE GRAPHIC CHARACTER INTRODUCER}}. This name is an architectural concept from early drafts of ISO/IEC 10646-1, but it was never approved
==External links==
*[https://www.unicode.org/reports/tr44/ Unicode Character Database], annex #44, explaining the different properties
*[https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt UnicodeData.txt] – a list of all Unicode characters, with their properties
|