Content deleted Content added
→top: unichar template is not working for control characters |
m →Interlinear annotation: fix misguided link |
||
(15 intermediate revisions by 9 users not shown) | |||
Line 10:
The [[ISO/IEC 8859]] series of encodings conforms to [[ISO/IEC 4873]] (ECMA-43) level 1, a subset of ISO/IEC 2022 designed for 8-bit character encodings, and therefore reserves the range 0x80–0x9F for use as non-printing codes by C1 control code sets such as ISO/IEC 6429.<ref>{{citation|mode=cs1 |quotation=This set of coded graphic characters may be regarded as a version of an 8-bit code according to ISO/IEC 2022 or ISO/IEC 4873 at level 1. […] The shaded positions in the code table correspond to bit combinations that do not represent graphic characters. Their use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429. |url=http://www.open-std.org/JTC1/sc2/wg3/docs/n411.pdf |title=Final Text of DIS 8859-1, 8-bit single-byte coded graphic character sets—Part 1: Latin alphabet No.1 |author=ISO/IEC JTC 1/SC 2/WG 3 |author-link=ISO/IEC JTC 1/SC 2 |id=[[ISO]]/[[International Electrotechnical Commission|IEC]] [[International Organization for Standardization#Standardization process|FDIS]] 8859-1:1998; JTC1/SC2/N2988; WG3/N411 |date=1998-02-12}}</ref> Unicode inherits its [[Basic Latin (Unicode block)|first]] and [[Latin-1 Supplement (Unicode block)|second]] blocks (comprising U+0000 through U+00FF) from ASCII and [[ISO/IEC 8859-1]], thus incorporating the C0 and C1 control code ranges (U+0000–U+001F, U+007F–U+009F) as general category "Cc". It does not assign normative names to these control codes, though it does assign them normative aliases.<ref name="aliases" />
Category "Cc" control codes can serve a variety of purposes, not limited to format effectors: for example, the default ASCII C0 set includes six format effectors ({{ctrl|BS}}, {{ctrl|HT}}, {{ctrl|LF}}, {{ctrl|VT}}, {{ctrl|FF}} and {{ctrl|CR}}), ten transmission controls, four device controls, four information separators and eight other control codes.<ref name="ir001">{{
* {{
* {{
* {{
* {{
* {{
* {{
Unicode only specifies semantics for {{tt|U+0009—U+000D}}, {{tt|U+001C—U+001F}}, and {{tt|U+0085}} (the ASCII format effectors except for {{ctrl|BS}}, plus the ASCII information separators and the C1 {{ctrl|NEL}}). The rest of the "Cc" control codes are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default.<ref name="unicode-23-1">{{cite
== Unicode introduced separators ==
In an attempt to simplify the several [[newline]] characters used in legacy text{{citation needed|date=November 2014}}, Unicode introduces its own newline characters to separate either lines or paragraphs: {{unichar|2028|line separator
Like CR and LF, LS and PS are effectors for text formatting; unlike CR and LF, they are not treated as "control codes" for [[ECMA-35]]/[[ECMA-48]] purposes (category {{code|Cc}}), rather having semantics defined entirely by Unicode itself. They are assigned to ''[[sui generis]]'' [[Unicode character property#General Category|Unicode categories]] {{code|Zl}} and {{code|Zp}} respectively, under the major category {{code|Z}} (separator) used for certain [[whitespace character]]s.
Line 26:
== Language tags ==
{{main|Tags (Unicode block)}}
Unicode
These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in.
The tag characters
Unicode states that "the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.<ref name="migration" />
== Interlinear annotation ==
Three formatting characters provide support for [[
== Bidirectional text control ==
{{main|
Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسم الله”) (translated into English as "Bismillah") right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right.
However, directionality may not be detected correctly if left-to-right text is quoted at the beginning of a right-to-left paragraph (or ''vice versa''),<ref name="segan"/> and the support for bidirectional text becomes even more complicated when text flowing in opposite directions is embedded hierarchically, for example if an English text quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides twelve characters
* {{unichar|061C|ARABIC LETTER MARK}}
* {{unichar|200E|LEFT-TO-RIGHT MARK}}
* {{unichar|200F|RIGHT-TO-LEFT MARK}}
* {{unichar|202A|LEFT-TO-RIGHT EMBEDDING}}
* {{unichar|202B|RIGHT-TO-LEFT EMBEDDING}}
* {{unichar|202C|POP DIRECTIONAL FORMATTING}}
* {{unichar|202D|LEFT-TO-RIGHT OVERRIDE}}
* {{unichar|202E|RIGHT-TO-LEFT OVERRIDE}}
* {{unichar|2066|LEFT-TO-RIGHT ISOLATE}}
* {{unichar|2067|RIGHT-TO-LEFT ISOLATE}}
* {{unichar|2068|FIRST STRONG ISOLATE}}
* {{unichar|2069|POP DIRECTIONAL ISOLATE}}
== Variation selectors ==
Line 67 ⟶ 76:
{{unicode navigation}}
[[Category:Unicode special code points|Control characters]]
|