Content deleted Content added
m 1 April... |
merge suggestion |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 1:
{{more citations needed|date=December 2009}}
{{Short description|Type of character encoding scheme}}{{Merge|Variable-length code
| date = February 2025
}}{{About|the storage of text in computers|the transmission of data across noisy channels|variable-length code}}
{{Use dmy dates|date=December 2023}}
A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation, usually in a [[computer]].<ref>{{Cite RFC|last=Crispin|first=M.|date=1 April 2005|title=UTF-9 and UTF-18 Efficient Transformation Formats of Unicode|doi=10.17487/rfc4042|doi-access=}}</ref>{{efn|The concept long precedes the advent of the electronic computer, however, as seen with [[Morse code]].}} Most common variable-width encodings are '''multibyte encodings''' (aka '''MBCS''' – '''multi-byte character set'''), which use varying numbers of [[byte]]s ([[octet (computing)|octets]]) to encode different characters. (Some authors, notably in [[Microsoft]] documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set.)
Early variable
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8
==General structure==
Line 15:
For example, the four character string "[[I Love New York|I♥NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): {{mono|49 {{maroon|E2}} {{navy (color)|99}} {{navy (color)|A5}} 4E 59}}. Of the six units in that sequence, 49, 4E, and 59 are singletons (for ''I, N,'' and ''Y''), {{maroon|E2}} is a lead unit and {{navy (color)|99}} and {{navy (color)|A5}} are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.
UTF-8 makes it easy for a program to identify the three sorts of units, since they fall into separate value ranges. Older variable-width encodings are typically not as well
==CJK multibyte encodings==
Line 33:
* [[Lotus Multi-Byte Character Set]] (LMBCS)
* [[Triple-Byte Character Set]] (TBCS)
* [[Double-byte character set|Double-Byte Character Set]] (DBCS)
* [[SBCS|Single-Byte Character Set]] (SBCS)
|