Unicode control characters: Difference between revisions

Content deleted Content added
No edit summary
Line 3:
In the narrowest sense, a control character is one with the [[Unicode character property#General Category|general category]] {{code|Cc}}, which comprises the [[C0 and C1 control codes]], a concept defined in [[ISO/IEC 2022]] and inherited by Unicode, with the most common set being defined in [[ISO/IEC 6429]]. In a broader sense, other non-printing format characters, such as those used in [[bidirectional text]], are also referred to as control characters.<ref>{{cite web |url=http://kvota.net/guadec/localised-desktop-talk/ |title=Towards a localised desktop |quotation=For some cases where automatic decision making doesn't work, you can manually add specific direction markers by right-clicking the text field, choosing "Insert Unicode control character" from the menu, and selecting appropriate direction mark. This would allow you, for instance, to start your RTL text with an otherwise LTR word (such as "GNOME"). |first=Danilo |last=Segan}}</ref>
 
== ISOCategory 6429"Cc" control characterscodes (C0 and C1) ==
{{main|C0 and C1 control codes}}
The control code ranges 0x00–0x1F ("C0") and 0x7F originate from the 1967 edition of [[US-ASCII]]. The standard [[ISO/IEC 2022]] (ECMA-35) defines extension methods for ASCII, including a secondary "C1" range of 8-bit control codes from 0x80 to 0x9F, equivalent to 7-bit sequences of {{ctrl|ESC}} with the bytes 0x40 through 0x5F. Collectively, codes in these ranges are known as the [[C0 and C1 control codes]]. Although ISO/IEC 2022 allows for the existence of multiple control code sets specifying differing interpretations of these control codes, their most common interpretation is specified in [[ISO/IEC 6429]] (ECMA-48).
The [[control characters]] U+0000–U+001F and U+007F come from [[ASCII]]. Additionally, U+0080–U+009F were used in conjunction with [[ISO/IEC 8859|ISO 8859]] character sets (among others). They are specified in [[ISO/IEC 6429|ISO 6429]] and often referred to as [[C0 and C1 control codes]] respectively.
 
The [[ISO/IEC 8859]] series of encodings conforms to [[ISO/IEC 4873]] (ECMA-43) level 1, a subset of ISO/IEC 2022 designed for 8-bit character encodings, and therefore reserves space for use by a C1 control code set such as ISO/IEC 6429. Unicode inherits its [[Basic Latin (Unicode block)|first]] and [[Latin-1 Supplement (Unicode block)|second]] blocks (comprising U+0000 through U+00FF) from ASCII and [[ISO/IEC 8859-1]], thus incorporating the C0 and C1 control code ranges (U+0000&ndash;U+001F, U+007F&ndash;U+009F).
Most of these characters play no explicit role in Unicode text handling. The characters {{unichar|0000|note=NUL}}, {{unichar|0009|Horizontal tabulation|nlink=tab key|note=HT}}, {{unichar|000A|Line feed|nlink=newline|note=LF}}, {{unichar|000D|carriage return|note=CR}}, and {{unichar|0085|NEL|note=NEL}} are commonly used in text processing as formatting characters.
 
Most of these characters play no explicit role in Unicode text handling, and are used only by higher-level protocols such as those used by [[terminal emulator]]s. The characters {{unichar|0000|note=NUL}}, {{unichar|0009|Horizontal tabulation|nlink=tab key|note=HT}}, {{unichar|000A|Line feed|nlink=newline|note=LF}}, {{unichar|000D|carriage return|note=CR}}, and {{unichar|0085|NEL|note=NEL}} are commonly used in text processing as formatting characters. Unicode only specifies semantics for U+0009&mdash;U+000D, U+001C&mdash;U+001F, and U+0085. The rest of the control characters are transparent to Unicode and their meanings are left to higher-level protocols, although interpretation as defined in ISO/IEC 6429 is suggested as a default.<ref name="unicode-23-1">{{cite book |url=https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf#page=3 |title=23.1: Control Codes |work=The Unicode Standard |edition=12.0.0 |date=2019 |author=Unicode Consortium |author-link=Unicode Consortium |isbn=978-1-936213-22-1 |pages=868–870}}</ref> Furthermore, certain specialised higher-level protocols, such as transcoded [[Teletext]], may include a [[Teletext character set#Control characters|different interpretation]] of the entire C0 control code range.<ref>{{cite web |url=https://corp.unicode.org/pipermail/unicode/2020-October/009120.html |title=Teletext separated mosaic graphics |work=Unicode Mailing List Archive |last=Ewell |first=Doug |date=2020-10-16 |publisher=[[Unicode Consortium]] |quotation=I reiterate that it was UTC {{bracket|[[Unicode Technical Committee]]}} and Script Ad Hoc who provided the guidance to the group writing the [[Symbols for Legacy Computing]] proposal (and there is a second on the way) that 0x00 through 0x1F in the original teletext set should map to U+0000 through U+001F when converting to Unicode.}}</ref>
 
== Unicode introduced separators ==