Content deleted Content added
No edit summary |
m Typo fixing, typo(s) fixed: For example → For example, (2) |
||
Line 14:
Unicode previously included 128 characters, now deprecated, for language tags. These characters essentially mirrored the 128 ASCII characters but were used to identify the subsequent text as belonging to a particular language according to [[BCP 47]]. For example, to indicate subsequent text as the variant of English as written in the United States, the initiating ‘Language Tag character’ (U+E0001) followed by the sequence ‘Tag Small Letter e’ (U+E0065), ‘Tag Small Letter n’ (U+E006E), ‘Tag Hyphen-minus’ (U+E002D), ‘Tag Small Letter u’ (U+E0075) and ‘Tag Small Letter s’ (U+E0073) would have been used.
These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in.
The tag characters U+E0001, U+E0020–U+E007E, and U+E007F were deprecated in Unicode 5.1 (2008) and should not be used for language information.<ref>{{cite web|url=http://tools.ietf.org/html/rfc6082|title=RFC6082: Deprecating Unicode Language Tag Characters: RFC 2482 is Historic | publisher=Internet Engineering Task Force (IETF)|date=November 2010}}</ref>
Line 24:
== Interlinear annotation ==
Three formatting characters provide support for [[Ruby text|interlinear annotation]] (U+FFF9, U+FFFA, U+FFFB). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C [[
== Bidirectional text control ==
Line 34:
{{main|Variation selector (Unicode)}}
Many characters map to alternate glyphs depending on the context. For example, Arabic and Latin cursive characters substitute different glyphs to connect glyphs together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute.
However, for other glyph substitution, the author's intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as [[Kanji#Gaiji|gaiji]] where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character. If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant. As of Unicode 3.2 and 4.0, the character set now includes 256 variation selectors so that these combining mark characters can select from 256 possible character/glyph variations for the preceding character.
|