UTF-EBCDIC: Difference between revisions

Content deleted Content added
Add infobox
Tags: Visual edit Mobile edit Mobile web edit Advanced mobile edit
Syk0saje (talk | contribs)
m Make definition more readable
Line 7:
}}
 
'''UTF-EBCDIC''' is a [[character encoding]] capable of encoding all 1,112,064 valid character [[code point]]s in [[Unicode]] using one1 to five5 one-[[byte]] (8-bit) code unitss (in contrast to a maximum of four4 for [[UTF-8]]).<ref>{{Cite web|title=UTR #16: UTF-EBCDIC|url=https://www.unicode.org/reports/tr16/tr16-8.html|quote=You need to search at most five bytes (seven bytes, if the full range of 31 bits of ISO/IEC 10646 is considered) backwards|access-date=2021-02-23|website=www.unicode.org}}</ref> It is meant to be [[EBCDIC]]-friendly, so that legacy EBCDIC applications on [[Mainframe computer|mainframes]] may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to [[UTF-8]]'s advantages for existing [[ASCII]]-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
 
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the [[C1 control code]]s) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+03FF are larger than the UTF-8 encoding.