Content deleted Content added
m Punctuation explained |
Drmccreedy (talk | contribs) m switch to a proper ref using https instead of ftp |
||
(30 intermediate revisions by 10 users not shown) | |||
Line 1:
{{Short description|Unicode code
{{Use British English|date=January 2025}}
The [[Unicode Standard]] assigns various properties to each Unicode character and [[code point]].<ref name="Chapter4">{{cite web|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-4/|date=September 2024|title=The Unicode Standard Version 16 |publisher=The Unicode Consortium |access-date=2024-09-13}}</ref><ref name="UAX44" />
The properties can be used to handle characters (code points) in processes, like in line-breaking, script direction right-to-left or applying controls. Some "character properties" are also defined for code points that have no character assigned and code points that are
Properties have levels of forcefulness: normative, informative, contributory, or provisional. For simplicity of specification, a character property can be assigned by specifying a continuous range of code points that have the same property.<ref>{{cite web|url=https://www.unicode.org/reports/tr44/#Code_Point_Ranges|title=Unicode Standard Annex #44: Unicode Character Database, 4.2.3 Code Point Ranges|website=Unicode |date=2024-08-27}}</ref>
Line 11 ⟶ 12:
[code];[name];[gc];[cc];[bc];[decomposition];[nv-dec];[nv-dig];[nv-num];[bm];[alias];;[upper case];[lower case];[title case]
*
*
*
*
*<code>decomposition</code> type or <mapping> = letter + diacritic, ligature X Y, superscript X, font X, initial X, medial X, final X, isolated X, vertical X, etc.
*
*
The property between
=={{anchor|Name}}Name and alias==
A Unicode character is assigned a unique
The following
==={{anchor|Version 1.0 names}}Unicode 1.0 names===▼
▲The following classes of code point do not have a Name (na=""): Controls (General Category: Cc), Private use (Co), Surrogate (Cs), Non-characters (Cn) and Reserved (Cn). They may be referenced, informally, by a generic or specific meta-name, called "Code Point Labels": {{not a typo|<control>, <control-0088>, <reserved>, <noncharacter-''hhhh''>, <private-use-''hhhh''>, or <surrogate>}}. Since these labels contain <>-brackets, they can never appear as a Name, which prevents confusion.
In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused
For example, {{Unichar|264}} has the Unicode 1.0 name "LATIN SMALL LETTER BABY GAMMA".
▲===Version 1.0 names===
▲In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused version 1.0-names were moved to the property Alias, to provide some backward compatibility.
===Character name alias===
{{main|Unicode alias names and abbreviations}}
Starting from Unicode
In addition to character name aliases which are corrections to defective character names, some characters are assigned aliases which are alternative names or abbreviations. Five types of character name aliases are defined in the Unicode Standard:
* Correction: corrections for misspelled or seriously incorrect character names;
* Control: [[ISO 6429]] names for C0 and C1 control functions (which are not assigned character names in the Unicode Standard);
* Alternate: alternative names for some format characters (only
* Figment: Documented labels for some C1 control code functions which are not actual names in any standard;
* Abbreviation: Abbreviations or acronyms for control codes, format characters, spaces, and variation selectors.
All formal character name aliases follow the rules for permissible character names, and are guaranteed to be unique within both the character name alias and the character name namespaces (for this reason, the ISO 6429 name "BELL" is not defined as an alias for
As of Unicode
Apart from these normative names,
==General Category==
Line 54:
===Punctuation===
Characters have separate properties to denote they are a [[punctuation]] character. The properties all have a [[boolean value|Yes/No values]]:
{{main|Dash|Quotation mark glyphs#Quotation marks in Unicode|Terminal punctuation}}
===Whitespace===
{{main|Whitespace character}}
'''Whitespace''' is a commonly used concept for a typographic effect. Basically it covers invisible characters that have a spacing effect in rendered text. It includes [[Space (punctuation)|spaces]], tabs, and new line formatting controls. In Unicode, such a character has the property set
{{Whitespace (Unicode)|state=collapsed}}
===Casing===
The Case value is
<!--(upper, lower, title, folding—both simple and full)-->▼
Different languages have different case mapping rules.
In Turkish, {{Unichar|0069}} corresponds to {{Unichar|0130}} instead of {{Unichar|0049}}. Similarly, {{Unichar|0049}} when corresponds to {{Unichar|0131}} instead of {{Unichar|0069}}.
In [[Nawdm]], the letter Ĥ corresponds to ɦ in lowercase instead of the usual case mappings being Ĥĥ and Ɦɦ.
In Greek, the letter sigma has different lowercase forms depending on where it is in a word. {{Unichar|03a3}} converts to {{Unichar|03c3}} if it is at the start or middle of a word, and converts to {{Unichar|03c2}} if it is at the end of a word.
In Lithuanian, the dot in lowercase i and j is preserved when followed by accents. For example: Í in lowercase is i̇́.<ref>{{Cite web|url=https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt|title=Unicode Character Database: Special Casing Data|date=2024-05-10}}</ref>
Despite the existence of {{Unichar|1E9E}}, {{Unichar|00DF}} corresponds to "SS".
Unicode encodes 31 titlecase characters.
* {{Unichar|01C5}}
* {{Unichar|01C8}}
* {{Unichar|01CB}}
* {{Unichar|01F2}}
* {{Unichar|1F88}}
* {{Unichar|1F89}}
* {{Unichar|1F8A}}
* {{Unichar|1F8B}}
* {{Unichar|1F8C}}
* {{Unichar|1F8D}}
* {{Unichar|1F8E}}
* {{Unichar|1F8F}}
* {{Unichar|1F98}}
* {{Unichar|1F99}}
* {{Unichar|1F9A}}
* {{Unichar|1F9B}}
* {{Unichar|1F9C}}
* {{Unichar|1F9D}}
* {{Unichar|1F9E}}
* {{Unichar|1F9F}}
* {{Unichar|1FA8}}
* {{Unichar|1FA9}}
* {{Unichar|1FAA}}
* {{Unichar|1FAB}}
* {{Unichar|1FAC}}
* {{Unichar|1FAD}}
* {{Unichar|1FAE}}
* {{Unichar|1FAF}}
* {{Unichar|1FBC}}
* {{Unichar|1FCC}}
* {{Unichar|1FFC}}
▲(upper, lower, title, folding—both simple and full)
{{expand section|date=March 2022}}
Line 75 ⟶ 120:
==Combining class==
{{more|Combining character}}
Some common codes:
:0 = spacing letter, symbol or modifier (e.g. {{Char|a}}, {{Char|(}}, {{Char|ʰ}})
:1 = overlay
:6 = Han reading (CJK diacritic reading marks)
Line 109 ⟶ 155:
==Bidirectional writing==
Six character properties pertain to bi-directional writing: ''Bidi_Class'', ''Bidi_Control'', ''Bidi_Mirrored'', ''Bidi_Mirroring_Glyph'', ''Bidi_Paired_Bracket'' and ''Bidi_Paired_Bracket_Type''.
One of Unicode's major features is support of bi-directional (''Bidi'') text display right-to-left (R-to-L) and left-to-right (L-to-R). The Unicode Bidirectional Algorithm UAX9<ref name="UAX9">{{cite web|url=https://www.unicode.org/reports/tr9/ |title=Unicode Standard Annex #9: Unicode Bidirectional Algorithm|work=The Unicode Standard|date=2024-09-02}}</ref> describes the process of presenting text with altering script directions. For example, it enables a Hebrew quote in an English text. The ''Bidi_Character_Type'' marks a character's behaviour in directional writing. To override a direction, Unicode has defined special ''formatting control characters'' (
Each code point has a property called
{{Bidi Class (Unicode)}}
In normal situations, the algorithm can determine the direction of a text by this character property. To control more complex Bidi situations, e.g. when an English text has a Hebrew quote, extra options are added to Unicode.
Basically, the algorithm determines a sequence of characters with the same strong direction type (R-to-L ''or'' L-to-R), taking in account an overruling by the special Bidi-controls. Number strings (Weak types) are assigned a direction according to their strong environment, as are Neutral characters. Finally, the characters are displayed per a string's direction.
Two character properties are relevant to determining a mirror image of a glyph in bidirectional text:
<!-- Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type go here -->
Line 127 ⟶ 173:
===Decimal===
Characters are classified with a
The characters that do have a numeric value are separated in three groups: Decimal (De), Digit (Di) and Numeric (Nu, i.e. all other). "Decimal" means the character is a straight decimal digit. Only characters that are part of a contiguous encoded range 0..9 have numeric type Decimal. Other digits, like superscripts, have numeric type Digit. All numeric characters like fractions and Roman numerals end up with the type "Numeric". The intended effect is that a simple parser can use these decimal numeric values, without being distracted by say a numeric superscript or a fraction. Eighty-three CJK Ideographs that represent a number, including those used for accounting, are typed Numeric.
On the other hand, characters that could have a numeric value as a second meaning are still marked Numeric type
{{Numeric Type (Unicode)}}
===Hexadecimal digits===
[[Hexadecimal]] characters are those in the series with hexadecimal values
{{Hexadecimal digit (Unicode)}}
Forty-four characters are marked as ''Hex_Digit''. The ones in the Basic Latin block are also marked as
Unicode has no separate characters for hexadecimal values. A consequence is, that when using regular characters it is not possible to determine whether hexadecimal value is intended, or even whether a value is intended at all. That should be determined at a higher level, e.g. by prepending
==Block==
{{main|Unicode block}}
A
{{Unicode blocks|state=mw-collapsed}}
Line 166 ⟶ 211:
==Age==
==Deprecated==
Once a character has been defined, it will not be removed or reassigned.<ref>{{cite web |url=https://www.unicode.org/policies/stability_policy.html |title=Unicode Character Encoding Stability Policies |website=Unicode |date=2024-01-09 |access-date=2024-01-13 |publisher=[[Unicode Consortium]] |quote=Once a character is encoded, it will not be moved or removed.}}</ref> However, a character may be [[deprecation|deprecated]], meaning its "use is strongly discouraged".<ref>{{cite web|title=The Unicode Standard, D13 Deprecated character |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G48383 |date=2024 |access-date=2024-09-13 }}</ref> As of Unicode version {{Unicode version|version=
{|class="wikitable sortable mw-collapsible {{{state|mw-uncollapsed}}}" style="margin:0"
Line 189 ⟶ 234:
|U+0627 U+065F
|اٟ
|
|-
|U+0F77
Line 271 ⟶ 316:
* Line
* Sentence
{{expand section|date=January 2025}}
==Alias name==
{{main|Unicode alias names and abbreviations}}Unicode can assign
;1. Abbreviation
Line 280 ⟶ 326:
;2. Control
:[[ISO 6429]] names for C0 and C1 control functions and similar commonly occurring names, are added as an alias to the character.
:For example, {{unichar|0008}} has the alias {{smallcaps2|BACKSPACE}}.
;3. Correction
:This is a correction for a "serious problem" in the primary character name, usually an error.
:For example, {{unichar|2118
;4. Alternate
:A widely used alternate name for a character.
:Example: {{unichar|FEFF|ZERO WIDTH NO-BREAK SPACE}} has the alternate alias {{smallcaps2|1=BYTE ORDER MARK}}.
;5. Figment
:Several documented labels for C1 control code points which were never actually approved in any standard (
:For example, {{unichar|0099}} has the figment alias {{smallcaps2|1=SINGLE GRAPHIC CHARACTER INTRODUCER}}. This name is an architectural concept from early drafts of ISO/IEC 10646-1, but it was never approved
==External links==
*[https://www.unicode.org/reports/tr44/ Unicode Character Database], annex #44, explaining the different properties
*[https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt UnicodeData.txt] – a list of all Unicode characters, with their properties
|