Content deleted Content added
Dpleibovitz (talk | contribs) Anchor for redirect |
Drmccreedy (talk | contribs) m switch to a proper ref using https instead of ftp |
||
(37 intermediate revisions by 15 users not shown) | |||
Line 1:
{{Short description|Unicode code
{{Use British English|date=January 2025}}
The [[Unicode Standard]] assigns various properties to each Unicode character and [[code point]].<ref name="Chapter4">{{cite
The properties can be used to handle characters (code points) in processes, like in line-breaking, script direction right-to-left or applying controls. Some "character properties" are also defined for code points that have no character assigned and code points that are
Properties have levels of forcefulness: normative, informative, contributory, or provisional. For simplicity of specification, a character property can be assigned by specifying a continuous range of code points that have the same property.<ref>{{cite web|url=https://www.unicode.org/reports/tr44/#Code_Point_Ranges|title=Unicode Standard Annex #44: Unicode Character Database, 4.2.3 Code Point Ranges|website=Unicode |date=
==Semantic elements==
Line 11 ⟶ 12:
[code];[name];[gc];[cc];[bc];[decomposition];[nv-dec];[nv-dig];[nv-num];[bm];[alias];;[upper case];[lower case];[title case]
*
*
*
*
*<code>decomposition</code> type or <mapping> = letter + diacritic, ligature X Y, superscript X, font X, initial X, medial X, final X, isolated X, vertical X, etc.
*
*
The property between
=={{anchor|Name}}Name and alias==
A Unicode character is assigned a unique
The following
==={{anchor|Version 1.0 names}}Unicode 1.0 names===▼
▲The following classes of code point do not have a Name (na=""): Controls (General Category: Cc), Private use (Co), Surrogate (Cs), Non-characters (Cn) and Reserved (Cn). They may be referenced, informally, by a generic or specific meta-name, called "Code Point Labels": {{not a typo|<control>, <control-0088>, <reserved>, <noncharacter-''hhhh''>, <private-use-''hhhh''>, or <surrogate>}}. Since these labels contain <>-brackets, they can never appear as a Name, which prevents confusion.
In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused
For example, {{Unichar|264}} has the Unicode 1.0 name "LATIN SMALL LETTER BABY GAMMA".
▲===Version 1.0 names===
▲In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused version 1.0-names were moved to the property Alias, to provide some backward compatibility.
===Character name alias===
{{main|Unicode alias names and abbreviations}}
Starting from Unicode
In addition to character name aliases which are corrections to defective character names, some characters are assigned aliases which are alternative names or abbreviations. Five types of character name aliases are defined in the Unicode Standard:
* Correction: corrections for misspelled or seriously incorrect character names;
* Control: [[ISO 6429]] names for C0 and C1 control functions (which are not assigned character names in the Unicode Standard);
* Alternate: alternative names for some format characters (only
* Figment: Documented labels for some C1 control code functions which are not actual names in any standard;
* Abbreviation: Abbreviations or acronyms for control codes, format characters, spaces, and variation selectors.
All formal character name aliases follow the rules for permissible character names, and are guaranteed to be unique within both the character name alias and the character name namespaces (for this reason, the ISO 6429 name "BELL" is not defined as an alias for
As of Unicode
Apart from these normative names,
==General Category==
Line 54:
===Punctuation===
Characters have separate properties to denote they are a [[punctuation]] character. The properties all have a [[boolean value|Yes/No values]]:
{{main|Dash|Quotation mark glyphs#Quotation marks in Unicode|Terminal punctuation}}
{{expand section|date=February 2012}}▼
===Whitespace===
{{main|Whitespace character}}
'''Whitespace''' is a commonly used concept for a typographic effect. Basically it covers invisible characters that have a spacing effect in rendered text. It includes [[Space (punctuation)|spaces]], tabs, and new line formatting controls. In Unicode, such a character has the property set
{{Whitespace (Unicode)|state=collapsed}}
===Casing===
The Case value is
<!--(upper, lower, title, folding—both simple and full)-->▼
Different languages have different case mapping rules.
In Turkish, {{Unichar|0069}} corresponds to {{Unichar|0130}} instead of {{Unichar|0049}}. Similarly, {{Unichar|0049}} when corresponds to {{Unichar|0131}} instead of {{Unichar|0069}}.
In [[Nawdm]], the letter Ĥ corresponds to ɦ in lowercase instead of the usual case mappings being Ĥĥ and Ɦɦ.
In Greek, the letter sigma has different lowercase forms depending on where it is in a word. {{Unichar|03a3}} converts to {{Unichar|03c3}} if it is at the start or middle of a word, and converts to {{Unichar|03c2}} if it is at the end of a word.
In Lithuanian, the dot in lowercase i and j is preserved when followed by accents. For example: Í in lowercase is i̇́.<ref>{{Cite web|url=https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt|title=Unicode Character Database: Special Casing Data|date=2024-05-10}}</ref>
Despite the existence of {{Unichar|1E9E}}, {{Unichar|00DF}} corresponds to "SS".
Unicode encodes 31 titlecase characters.
* {{Unichar|01C5}}
* {{Unichar|01C8}}
* {{Unichar|01CB}}
* {{Unichar|01F2}}
* {{Unichar|1F88}}
* {{Unichar|1F89}}
* {{Unichar|1F8A}}
* {{Unichar|1F8B}}
* {{Unichar|1F8C}}
* {{Unichar|1F8D}}
* {{Unichar|1F8E}}
* {{Unichar|1F8F}}
* {{Unichar|1F98}}
* {{Unichar|1F99}}
* {{Unichar|1F9A}}
* {{Unichar|1F9B}}
* {{Unichar|1F9C}}
* {{Unichar|1F9D}}
* {{Unichar|1F9E}}
* {{Unichar|1F9F}}
* {{Unichar|1FA8}}
* {{Unichar|1FA9}}
* {{Unichar|1FAA}}
* {{Unichar|1FAB}}
* {{Unichar|1FAC}}
* {{Unichar|1FAD}}
* {{Unichar|1FAE}}
* {{Unichar|1FAF}}
* {{Unichar|1FBC}}
* {{Unichar|1FCC}}
* {{Unichar|1FFC}}
▲(upper, lower, title, folding—both simple and full)
{{expand section|date=March 2022}}
Line 75 ⟶ 120:
==Combining class==
{{more|Combining character}}
Some common codes:
:0 = spacing letter, symbol or modifier (e.g. {{Char|a}}, {{Char|(}}, {{Char|ʰ}})
:1 = overlay
:6 = Han reading (CJK diacritic reading marks)
Line 109 ⟶ 155:
==Bidirectional writing==
Six character properties pertain to bi-directional writing: ''Bidi_Class'', ''Bidi_Control'', ''Bidi_Mirrored'', ''Bidi_Mirroring_Glyph'', ''Bidi_Paired_Bracket'' and ''Bidi_Paired_Bracket_Type''.
One of Unicode's major features is support of bi-directional (''Bidi'') text display right-to-left (R-to-L) and left-to-right (L-to-R). The Unicode Bidirectional Algorithm UAX9<ref name="UAX9">{{cite web|url=https://www.unicode.org/reports/tr9/ |title=Unicode Standard Annex #9: Unicode Bidirectional Algorithm|work=The Unicode Standard|date=
Each code point has a property called
{{Bidi Class (Unicode)}}
In normal situations, the algorithm can determine the direction of a text by this character property. To control more complex Bidi situations, e.g. when an English text has a Hebrew quote, extra options are added to Unicode.
Basically, the algorithm determines a sequence of characters with the same strong direction type (R-to-L ''or'' L-to-R), taking in account an overruling by the special Bidi-controls. Number strings (Weak types) are assigned a direction according to their strong environment, as are Neutral characters. Finally, the characters are displayed per a string's direction.
Two character properties are relevant to determining a mirror image of a glyph in bidirectional text:
<!-- Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type go here -->
Line 127 ⟶ 173:
===Decimal===
Characters are classified with a
The characters that do have a numeric value are separated in three groups: Decimal (De), Digit (Di) and Numeric (Nu, i.e. all other). "Decimal" means the character is a straight decimal digit. Only characters that are part of a contiguous encoded range 0..9 have numeric type Decimal. Other digits, like superscripts, have numeric type Digit. All numeric characters like fractions and Roman numerals end up with the type "Numeric". The intended effect is that a simple parser can use these decimal numeric values, without being distracted by say a numeric superscript or a fraction. Eighty-three CJK Ideographs that represent a number, including those used for accounting, are typed Numeric.
On the other hand, characters that could have a numeric value as a second meaning are still marked Numeric type
{{Numeric Type (Unicode)}}
===Hexadecimal digits===
[[Hexadecimal]] characters are those in the series with hexadecimal values
{{Hexadecimal digit (Unicode)}}
Forty-four characters are marked as ''Hex_Digit''. The ones in the Basic Latin block are also marked as
Unicode has no separate characters for hexadecimal values. A consequence is, that when using regular characters it is not possible to determine whether hexadecimal value is intended, or even whether a value is intended at all. That should be determined at a higher level, e.g. by prepending
==Block==
{{main|Unicode block}}
A
{{Unicode blocks|state=mw-collapsed}}
==Script==
{{main|Scripts in Unicode}}
Each assigned character can have a single value for its "Script" property, signifying to which script it belongs.<ref>{{cite web|url=https://www.unicode.org/reports/tr24/|title=Unicode Standard Annex #24: Unicode Script Property|work=The Unicode Standard|date=
The special code Zyyy for "Common" allows a single value for a character that is used in multiple scripts. The code Zinh "Inherited script", used for combining characters and certain other special-purpose code points, indicates that a character "inherits" its script identity from the character with which it is combined. (Unicode formerly used the private code Qaai for this purpose.) The code Zzzz "Unknown" is used for all characters that do not belong to a script (i.e. the default value), such as symbols and formatting characters. Overall, characters of a single script can be scattered over multiple blocks, like [[Latin character]]s. And the other way around too: multiple scripts can be present is a single block, e.g. block [[Letterlike Symbols]] contains characters from the Latin, Greek and Common scripts.
Line 166 ⟶ 211:
==Age==
==Deprecated==
Once a character has been defined, it will not be removed or reassigned.<ref>{{cite web |url=https://www.unicode.org/policies/stability_policy.html |title=Unicode Character Encoding Stability Policies |website=Unicode |date=
{|class="wikitable sortable mw-collapsible {{{state|mw-uncollapsed}}}" style="margin:0"
Line 189 ⟶ 234:
|U+0627 U+065F
|اٟ
|
|-
|U+0F77
Line 217 ⟶ 262:
|U+206A
|{{unichar/name|na=INHIBIT SYMMETRIC SWAPPING}}
|colspan=2|None{{efn|name=Depr02|Rather than using this [[control character]] to indicate the appropriate appearance for text, appropriate character codes with the correct state should be used.<ref>{{cite
|
|-
Line 259 ⟶ 304:
|U+E0001
|{{unichar/name|na=LANGUAGE TAG}}
|colspan=2|None{{efn|name=Depr04|Alternative means of language tagging should be used instead.<ref>{{cite
|
|- class="sortbottom"
Line 271 ⟶ 316:
* Line
* Sentence
==Alias name==
{{main|Unicode alias names and abbreviations}}Unicode can assign
;1. Abbreviation
Line 280 ⟶ 326:
;2. Control
:[[ISO 6429]] names for C0 and C1 control functions and similar commonly occurring names, are added as an alias to the character.
:For example, {{unichar|0008}} has the alias {{smallcaps2|BACKSPACE}}.
;3. Correction
:This is a correction for a "serious problem" in the primary character name, usually an error.
:For example, {{unichar|2118
;4. Alternate
:A widely used alternate name for a character.
:Example: {{unichar|FEFF|ZERO WIDTH NO-BREAK SPACE}} has the alternate alias {{smallcaps2|1=BYTE ORDER MARK}}.
;5. Figment
:Several documented labels for C1 control code points which were never actually approved in any standard (
:For example, {{unichar|0099}} has the figment alias {{smallcaps2|1=SINGLE GRAPHIC CHARACTER INTRODUCER}}. This name is an architectural concept from early drafts of ISO/IEC 10646-1, but it was never approved
==External links==
*[https://www.unicode.org/reports/tr44/ Unicode Character Database], annex #44, explaining the different properties
*[https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt UnicodeData.txt] – a list of all Unicode characters, with their properties
|