Variable-width encoding: Difference between revisions

Content deleted Content added
merge suggestion
 
(One intermediate revision by one other user not shown)
Line 1:
{{more citations needed|date=December 2009}}
{{Short description|Type of character encoding scheme}}{{Merge|Variable-length code
| date = February 2025
}}{{About|the storage of text in computers|the transmission of data across noisy channels|variable-length code}}
{{Use dmy dates|date=December 2023}}
A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation, usually in a [[computer]].<ref>{{Cite RFC|last=Crispin|first=M.|date=1 April 2005|title=UTF-9 and UTF-18 Efficient Transformation Formats of Unicode|doi=10.17487/rfc4042|doi-access=}}</ref>{{efn|The concept long precedes the advent of the electronic computer, however, as seen with [[Morse code]].}} Most common variable-width encodings are '''multibyte encodings''' (aka '''MBCS''' – '''multi-byte character set'''), which use varying numbers of [[byte]]s ([[octet (computing)|octets]]) to encode different characters. (Some authors, notably in [[Microsoft]] documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set.)
(Some authors, notably in [[Microsoft]] documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set.)
 
Early variable -width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in [[adventure game]]s for early [[microcomputer]]s. However [[disk storage|disks]] (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely obsolete.
 
Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8 &nbsp;bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16 &nbsp;bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.{{efn|As a real-life example of this, [[UTF-16]], which represents the most common characters in exactly the manner just described (and uses pairs of 16-bit code units for less-common characters) never gained traction as an encoding for text intended for interchange due to its incompatibility with the ubiquitous 7-/8-bit [[ASCII]] encoding, with its intended role instead being taken by [[UTF-8]], which ''does'' preserve ASCII compatibility.}}
 
==General structure==
Line 33:
* [[Lotus Multi-Byte Character Set]] (LMBCS)
* [[Triple-Byte Character Set]] (TBCS)
* [[Double-byte character set|Double-Byte Character Set]] (DBCS)
* [[SBCS|Single-Byte Character Set]] (SBCS)