Unicode control characters: Difference between revisions

Content deleted Content added
Update unichar template to match character name from definitive source https://unicode.org/Public/UNIDATA/UnicodeData.txt
m Interlinear annotation: fix misguided link
 
(7 intermediate revisions by 6 users not shown)
Line 17:
* {{tt|U+000D}} {{unichar/name|na=CARRIAGE RETURN (CR)}} (used in some line-breaking conventions)
* {{tt|U+0085}} {{unichar/name|na=NEXT LINE (NEL)}} (sometimes used as a line break in text transcoded from [[EBCDIC]])
Unicode only specifies semantics for {{tt|U+0009&mdash;U+000D}}, {{tt|U+001C&mdash;U+001F}}, and {{tt|U+0085}} (the ASCII format effectors except for {{ctrl|BS}}, plus the ASCII information separators and the C1 {{ctrl|NEL}}). The rest of the "Cc" control codes are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default.<ref name="unicode-23-1">{{cite bookweb |url=https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf#page=3 |title=23.1: Control Codes |work=The Unicode Standard |edition=12.0.0 |date=2019 |author=Unicode Consortium |author-link=Unicode Consortium |isbn=978-1-936213-22-1 |pages=868–870}}</ref> Furthermore, certain specialised higher-level protocols, such as transcoded [[Teletext]], may include a [[Teletext character set#Control characters|different interpretation]] of the entire C0 control code range.<ref>{{cite web |url=https://corp.unicode.org/pipermail/unicode/2020-October/009120.html |title=Teletext separated mosaic graphics |work=Unicode Mailing List Archive |last=Ewell |first=Doug |date=2020-10-16 |publisher=[[Unicode Consortium]] |quotation=I reiterate that it was UTC {{bracket|[[Unicode Technical Committee]]}} and Script Ad Hoc who provided the guidance to the group writing the [[Symbols for Legacy Computing]] proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode.}}</ref>
 
== Unicode introduced separators ==
Line 26:
== Language tags ==
{{main|Tags (Unicode block)}}
Unicode previously includedincludes 128 characters, now deprecated, forpreviously intended as language tags. These characters essentially mirrored the 128 ASCII characters but were used to identify the subsequent text as belonging to a particular language according to [[BCP 47]]. For example, to indicate subsequent text as the variant of English as written in the United States, the sequence {{unichar|E0001|LANGUAGE TAG}}, {{unichar|E0065|Tag Latin Small Letter e}}, {{unichar|E006E|Tag Latin Small Letter n}}, {{unichar|E002D|Tag Hyphen-minus}}, {{unichar|E0075|Tag Latin Small Letter u}} and {{unichar|E0073|Tag Latin Small Letter s}} would have been used.
 
These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in.
 
The tag characters {{unichar|E0001|LANGUAGE TAG}} and {{unichar|E007F|CANCEL TAG}} were deprecated in Unicode 5.1 (2008) and should not be used for language information.<ref>{{cite documentweb |url=http://tools.ietf.org/html/rfc6082|title=RFC6082: Deprecating Unicode Language Tag Characters: RFC 2482 is Historic | publisher=Internet Engineering Task Force (IETF)|date=November 2010|last1=Klensin |first1=John C. |last2=Presuhn |first2=Randy |last3=Whistler |first3=Ken |last4=Dürst |first4=Martin J. |last5=Adams |first5=Glenn |editor-first1=R. |editor-last1=Presuhn |doi=10.17487/RFC6082 |doi-access=free }}</ref> The characters {{tt|U+E0020—U+E0073}} were also deprecated, but were restored with the release of Unicode 8.0 (2015). The change was made "to clear the way for the potential future use of tag characters for a purpose other than to represent language tags".<ref name="migration">{{cite web|url=http://unicode.org/versions/Unicode8.0.0/#Migration|title=Unicode 8.0.0, Implications for Migration | publisher=Unicode Consortium}}</ref>
Unicode states that "the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.<ref name="migration" />
 
== Interlinear annotation ==
Three formatting characters provide support for [[RubyInterlinear textgloss|interlinear annotation]] ({{unichar|FFF9|INTERLINEAR ANNOTATION ANCHOR}}, {{unichar|FFFA|INTERLINEAR ANNOTATION SEPARATOR}}, {{unichar|FFFB|INTERLINEAR ANNOTATION TERMINATOR}}). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C [[Ruby character#Ruby markup|Ruby markup]] recommendation is an example of an alternate protocol supporting more advanced interlinear annotation.
 
== Bidirectional text control ==
{{main|Bi-directionalBidirectional text}}
Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسم الله”) (translated into English as "Bismillah") right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right.
 
Line 76:
{{unicode navigation}}
 
[[Category:Unicode special code points|Control characters]]