Unicode control characters: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:46, 21 December 2020 edit HarJIT (talk \| contribs) Extended confirmed users 12,434 edits →Category "Cc" control codes (C0 and C1) Tags: Mobile edit Mobile web edit Advanced mobile edit ← Previous edit		Latest revision as of 10:06, 29 May 2025 edit undo Michael Scheffenacker (talk \| contribs) 10 edits m →Interlinear annotation: fix misguided link
(33 intermediate revisions by 14 users not shown)
Line 1: {{short description\|Non-printing format effectors and control codes included in Unicode}} Many ~~'''~~[[Unicode]] ~~control~~ characters~~'''~~ are used to control the interpretation or display of text, but these characters themselves have no visual or spatial representation. For example, the [[null character]] ({{~~unichar~~tt\|U+0000~~\|NULL\|nlink~~}} <span style="font-size:85%;">[[control characters}}\|NULL]]</span>) is used in C-programming application environments to indicate the end of a string of characters. In this way, these programs only require a single starting memory address for a [[String (computer science)\|string]] (as opposed to a starting address and a length), since the string ends once the program reads the null character. In the narrowest sense, a ''control ~~character~~code'' is ~~one~~a character with the [[Unicode character property#General Category\|general category]] {{code\|Cc}}, which comprises the [[C0 and C1 control codes]], a concept defined in [[ISO/IEC 2022]] and inherited by Unicode, with the most common set being defined in [[ISO/IEC 6429]]. Control codes are handled distinctly from ordinary Unicode characters, for example, by not being assigned character names (although they are assigned normative formal aliases).<ref name="aliases">{{cite web \|url=https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt \|title=Name Aliases \|work=Unicode Character Database \|institution=[[Unicode Consortium]]}}</ref> In a broader sense, other non-printing format characters, such as those used in [[bidirectional text]], are also referred to as ''control characters'' by software;<ref name="segan">{{cite web \|url=http://kvota.net/guadec/localised-desktop-talk/ \|title=Towards a localised desktop \|quotation=For some cases where automatic decision making doesn't work, you can manually add specific direction markers by right-clicking the text field, choosing "Insert Unicode control character" from the menu, and selecting appropriate direction mark. This would allow you, for instance, to start your RTL text with an otherwise LTR word (such as "GNOME"). \|first=Danilo \|last=Segan}}</ref> these are mostly assigned to the general category {{code\|Cf}} (format), used for format effectors introduced and defined by Unicode itself. == Category "Cc" control codes (C0 and C1) == Line 8: The control code ranges 0x00–0x1F ("C0") and 0x7F originate from the 1967 edition of [[US-ASCII]]. The standard [[ISO/IEC 2022]] (ECMA-35) defines extension methods for ASCII, including a secondary "C1" range of 8-bit control codes from 0x80 to 0x9F, equivalent to 7-bit sequences of {{ctrl\|ESC}} with the bytes 0x40 through 0x5F. Collectively, codes in these ranges are known as the [[C0 and C1 control codes]]. Although ISO/IEC 2022 allows for the existence of multiple control code sets specifying differing interpretations of these control codes, their most common interpretation is specified in [[ISO/IEC 6429]] (ECMA-48). The [[ISO/IEC 8859]] series of encodings conforms to [[ISO/IEC 4873]] (ECMA-43) level 1, a subset of ISO/IEC 2022 designed for 8-bit character encodings, and therefore ~~designates~~reserves the range 0x80–0x9F for use byas anon-printing codes by C1 control code ~~set~~sets such as ISO/IEC 6429.<ref>{{citation\|mode=cs1 \|quotation=This set of coded graphic characters may be regarded as a version of an 8-bit code according to ISO/IEC 2022 or ISO/IEC 4873 at level 1. […] The shaded positions in the code table correspond to bit combinations that do not represent graphic characters. Their use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429. \|url=http://www.open-std.org/JTC1/sc2/wg3/docs/n411.pdf \|title=Final Text of DIS 8859-1, 8-bit single-byte coded graphic character sets—Part 1: Latin alphabet No.1 \|author=ISO/IEC JTC 1/SC 2/WG 3 \|author-link=ISO/IEC JTC 1/SC 2 \|id=[[ISO]]/[[International Electrotechnical Commission\|IEC]] [[International Organization for Standardization#Standardization process\|FDIS]] 8859-1:1998; JTC1/SC2/N2988; WG3/N411 \|date=1998-02-12}}</ref> Unicode inherits its [[Basic Latin (Unicode block)\|first]] and [[Latin-1 Supplement (Unicode block)\|second]] blocks (comprising U+0000 through U+00FF) from ASCII and [[ISO/IEC 8859-1]], thus incorporating the C0 and C1 control code ranges (U+0000–U+001F, U+007F–U+009F) as general category "Cc". It does not assign normative names to these control codes, though it does assign them normative aliases.<ref name="aliases" /> Category "Cc" control codes can serve a variety of purposes, not limited to format effectors: for example, the default ASCII C0 set includes six format effectors ({{ctrl\|BS}}, {{ctrl\|HT}}, {{ctrl\|LF}}, {{ctrl\|VT}}, {{ctrl\|FF}} and {{ctrl\|CR}}), ten transmission controls, four device controls, four information separators and eight other control codes.<ref name="ir001">{{cite iso-ir \|sponsor=ISO/TC 97/SC 2 \|sponsor-link=ISO/IEC JTC 1/SC 2#History \|title=The set of control characters of the ISO 646 \|date=1975 \|number=1}}</ref> Most of these characters play no explicit role in Unicode text handling, and are used only by higher-level protocols such as those used by [[terminal emulator]]s. Certain characters are commonly used for formatting or [[sentinel value\|sentinel]] purposes: Most of these characters play no explicit role in Unicode text handling, and are used only by higher-level protocols such as those used by [[terminal emulator]]s. The characters {{unichar\|0000\|note=NUL}}, {{unichar\|0009\|Horizontal tabulation\|nlink=tab key\|note=HT}}, {{unichar\|000A\|Line feed\|nlink=newline\|note=LF}}, {{unichar\|000D\|carriage return\|note=CR}}, and {{unichar\|0085\|NEL\|note=NEL}} are commonly used in text processing as formatting characters. Unicode only specifies semantics for U+0009—U+000D, U+001C—U+001F, and U+0085. The rest of the control characters are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default.<ref name="unicode-23-1">{{cite book \|url=https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf#page=3 \|title=23.1: Control Codes \|work=The Unicode Standard \|edition=12.0.0 \|date=2019 \|author=Unicode Consortium \|author-link=Unicode Consortium \|isbn=978-1-936213-22-1 \|pages=868–870}}</ref> Furthermore, certain specialised higher-level protocols, such as transcoded [[Teletext]], may include a [[Teletext character set#Control characters\|different interpretation]] of the entire C0 control code range.<ref>{{cite web \|url=https://corp.unicode.org/pipermail/unicode/2020-October/009120.html \|title=Teletext separated mosaic graphics \|work=Unicode Mailing List Archive \|last=Ewell \|first=Doug \|date=2020-10-16 \|publisher=[[Unicode Consortium]] \|quotation=I reiterate that it was UTC {{bracket\|[[Unicode Technical Committee]]}} and Script Ad Hoc who provided the guidance to the group writing the [[Symbols for Legacy Computing]] proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode.}}</ref>▼ * {{tt\|U+0000}} {{unichar/name\|na=NULL}} (used in [[null-terminated string]]s) * {{tt\|U+0009}} {{unichar/name\|na=HORIZONTAL TABULATION (HT)}} (inserted by the [[tab key]]) * {{tt\|U+000A}} {{unichar/name\|na=LINE FEED (LF)}} (used as a [[newline\|line break]]) * {{tt\|U+000C}} {{unichar/name\|na=FORM FEED (FF)}} (denotes a [[page break]] in a plain text file) * {{tt\|U+000D}} {{unichar/name\|na=CARRIAGE RETURN (CR)}} (used in some line-breaking conventions) * {{tt\|U+0085}} {{unichar/name\|na=NEXT LINE (NEL)}} (sometimes used as a line break in text transcoded from [[EBCDIC]]) ▲~~Most of these characters play no explicit role in~~ Unicode ~~text handling, and are used~~ only byspecifies ~~higher-level~~semantics ~~protocols such as those used by [[terminal emulator]]s. The characters~~for {{~~unichar\|0000\|note=NUL}}, {{unichar~~tt\|U+0009~~\|Horizontal tabulation\|nlink=tab key\|note=HT~~—U+000D}}, {{~~unichar~~tt\|~~000A\|Line feed\|nlink=newline\|note=LF}}, {{unichar\|000D\|carriage return\|note=CR~~U+001C—U+001F}}, and {{~~unichar~~tt\|U+0085~~\|NEL\|note=NEL~~}} ~~are~~(the ~~commonly~~ASCII ~~used~~format ineffectors ~~text~~except ~~processing~~for as{{ctrl\|BS}}, ~~formatting~~plus ~~characters.~~the ~~Unicode~~ASCII ~~only~~information ~~specifies~~separators ~~semantics~~and ~~for~~the ~~U+0009—U+000D,~~C1 ~~U+001C—U+001F, and U+0085~~{{ctrl\|NEL}}). The rest of the "Cc" control ~~characters~~codes are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default.<ref name="unicode-23-1">{{cite ~~book~~web \|url=https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf#page=3 \|title=23.1: Control Codes \|work=The Unicode Standard \|edition=12.0.0 \|date=2019 \|author=Unicode Consortium \|author-link=Unicode Consortium \|isbn=978-1-936213-22-1 \|pages=868–870}}</ref> Furthermore, certain specialised higher-level protocols, such as transcoded [[Teletext]], may include a [[Teletext character set#Control characters\|different interpretation]] of the entire C0 control code range.<ref>{{cite web \|url=https://corp.unicode.org/pipermail/unicode/2020-October/009120.html \|title=Teletext separated mosaic graphics \|work=Unicode Mailing List Archive \|last=Ewell \|first=Doug \|date=2020-10-16 \|publisher=[[Unicode Consortium]] \|quotation=I reiterate that it was UTC {{bracket\|[[Unicode Technical Committee]]}} and Script Ad Hoc who provided the guidance to the group writing the [[Symbols for Legacy Computing]] proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode.}}</ref> == Unicode introduced separators == In an attempt to simplify the several [[newline]] characters used in legacy text{{citation needed\|date=November 2014}}, Unicode introduces its own newline characters to separate either lines or paragraphs: {{unichar\|2028\|line separator~~\|html=~~}} (abbreviated ~~{{ctrl\|~~LS}} or ~~{{ctrl\|~~LSEP}}) and {{unichar\|2029\|paragraph separator~~\|html=~~}} (abbreviated ~~{{ctrl\|~~PS}} or ~~{{ctrl\|~~PSEP}}). Like CR and LF, LS and PS are effectors for text formatting; unlike CR and LF, they are not treated as "control codes" for [[ECMA-35]]/[[ECMA-48]] purposes (category {{code\|Cc}}), rather having semantics defined entirely by Unicode itself. They are assigned to ''[[sui generis]]'' [[Unicode character property#General Category\|Unicode categories]] {{code\|Zl}} and {{code\|Zp}} respectively, under the major category {{code\|Z}} (separator) used for certain [[whitespace character]]s. Line 19 ⟶ 26: == Language tags == {{main\|Tags (Unicode block)}} Unicode ~~previously included~~includes 128 characters, now deprecated, ~~for~~previously intended as language tags. These characters essentially mirrored the 128 ASCII characters but were used to identify the subsequent text as belonging to a particular language according to [[BCP 47]]. For example, to indicate subsequent text as the variant of English as written in the United States, the ~~initiating~~sequence ~~‘Language Tag character’ (U+~~{{unichar\|E0001)\|LANGUAGE ~~followed~~TAG}}, by{{unichar\|E0065\|Tag ~~the sequence ‘Tag~~Latin Small Letter ~~e’ (U+E0065)~~e}}, ~~‘Tag~~{{unichar\|E006E\|Tag Latin Small Letter ~~n’ (U+E006E)~~n}}, ~~‘Tag~~{{unichar\|E002D\|Tag Hyphen-~~minus’ (U+E002D)~~minus}}, ~~‘Tag~~{{unichar\|E0075\|Tag Latin Small Letter ~~u’ (U+E0075)~~u}} and ~~‘Tag~~{{unichar\|E0073\|Tag Latin Small Letter ~~s’ (U+E0073)~~s}} would have been used. These language tag characters would not be displayed themselves. However, they would provide information for text processing or even for the display of other characters. For example, the display of Unihan ideographs might have substituted different glyphs if the language tags indicated Korean than if the tags indicated Japanese. Another example, might have influenced the display of decimal digits 0 through 9 differently depending on the language they appeared in. The tag characters U+{{unichar\|E0001,\|LANGUAGE ~~U+E0020–U+E007E,~~TAG}} and U+{{unichar\|E007F\|CANCEL TAG}} were deprecated in Unicode 5.1 (2008) and should not be used for language information.<ref>{{cite web \|url=http://tools.ietf.org/html/rfc6082\|title=RFC6082: Deprecating Unicode Language Tag Characters: RFC 2482 is Historic \| publisher=Internet Engineering Task Force (IETF)\|date=November 2010\|last1=Klensin \|first1=John C. \|last2=Presuhn \|first2=Randy \|last3=Whistler \|first3=Ken \|last4=Dürst \|first4=Martin J. \|last5=Adams \|first5=Glenn \|editor-first1=R. \|editor-last1=Presuhn \|doi=10.17487/RFC6082 \|doi-access=free }}</ref> The characters {{tt\|U+E0020—U+E0073}} were also deprecated, but were restored with the release of Unicode 8.0 (2015). The change was made "to clear the way for the potential future use of tag characters for a purpose other than to represent language tags".<ref name="migration">{{cite web\|url=http://unicode.org/versions/Unicode8.0.0/#Migration\|title=Unicode 8.0.0, Implications for Migration \| publisher=Unicode Consortium}}</ref> ~~With the release of Unicode 8.0 (2015), U+E0020–U+E007E are no longer deprecated characters.~~ ~~(U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG remain deprecated.)~~ The change was made "to clear the way for the potential future use of tag characters for a purpose other than to represent language tags".<ref name="migration">{{cite web\|url=http://unicode.org/versions/Unicode8.0.0/#Migration\|title=Unicode 8.0.0, Implications for Migration \| publisher=Unicode Consortium}}</ref> Unicode states that "the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.<ref name="migration" /> == Interlinear annotation == Three formatting characters provide support for [[~~Ruby~~Interlinear ~~text~~gloss\|interlinear annotation]] (U+{{unichar\|FFF9\|INTERLINEAR ANNOTATION ANCHOR}}, U+{{unichar\|FFFA\|INTERLINEAR ANNOTATION SEPARATOR}}, U+{{unichar\|FFFB\|INTERLINEAR ANNOTATION TERMINATOR}}). This may be used for providing notes that would typically be displayed between the lines of other text. Unicode considers such annotation to be rich text and recommends using other protocols for such annotation. The W3C [[Ruby character#Ruby markup\|Ruby markup]] recommendation is an example of an alternate protocol supporting more advanced interlinear annotation. == Bidirectional text control == {{main\|~~Bi-directional~~Bidirectional text}} Unicode supports standard bidirectional text without any special characters. In other words Unicode conforming software should display right-to-left characters such as Hebrew letters as right-to-left simply from the properties of those characters. Similarly, Unicode handles the mixture of left-to-right-text alongside right-to-left text without any special characters. For example, one can quote Arabic (“بسم الله”) (translated into English as "Bismillah") right alongside English and the Arabic letters will flow from right-to-left and the Latin letters left-to-right. However, directionality may not be detected correctly if left-to-right text is quoted at the beginning of a right-to-left paragraph (or ''vice versa''),<ref name="segan"/> and the support for bidirectional text becomes even more complicated when text flowing in opposite directions is embedded hierarchically, for example if ~~one~~an English text quotes an Arabic phrase that in turn quotes an English phrase. Other situations may also complicate this, such as when an author wants the left-to-right characters overridden so that they flow from right-to-left. While these situations are fairly rare, Unicode provides twelve characters ~~(U+061C, U+200E, U+200F, U+202A, U+202B, U+202C, U+202D, U+202E, U+2066, U+2067, U+2068, U+2069)~~ to help control these embedded bidirectional text levels up to 125 levels deep.:<ref>{{Cite web\|url=http://unicode.org/reports/tr9/\|title=UAX #9: Unicode Bidirectional Algorithm\|publisher=Unicode Consortium\|date=2018-05-09}}</ref> * {{unichar\|061C\|ARABIC LETTER MARK}} * {{unichar\|200E\|LEFT-TO-RIGHT MARK}} * {{unichar\|200F\|RIGHT-TO-LEFT MARK}} * {{unichar\|202A\|LEFT-TO-RIGHT EMBEDDING}} * {{unichar\|202B\|RIGHT-TO-LEFT EMBEDDING}} * {{unichar\|202C\|POP DIRECTIONAL FORMATTING}} * {{unichar\|202D\|LEFT-TO-RIGHT OVERRIDE}} * {{unichar\|202E\|RIGHT-TO-LEFT OVERRIDE}} * {{unichar\|2066\|LEFT-TO-RIGHT ISOLATE}} * {{unichar\|2067\|RIGHT-TO-LEFT ISOLATE}} * {{unichar\|2068\|FIRST STRONG ISOLATE}} * {{unichar\|2069\|POP DIRECTIONAL ISOLATE}} == Variation selectors == Line 58 ⟶ 76: {{unicode navigation}} [[Category:Unicode special code points\|Control characters]]