Variable-width encoding: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:09, 28 January 2024 edit ReadOnlyAccount (talk \| contribs) Extended confirmed users 2,063 edits m →General structure ← Previous edit		Latest revision as of 21:26, 14 February 2025 edit undo Wqwt (talk \| contribs) Extended confirmed users 989 edits merge suggestion Tag: Visual edit
(One intermediate revision by one other user not shown)
Line 1: {{more citations needed\|date=December 2009}} {{Short description\|Type of character encoding scheme}}{{Merge\|Variable-length code \| date = February 2025 }}{{About\|the storage of text in computers\|the transmission of data across noisy channels\|variable-length code}} {{Use dmy dates\|date=December 2023}} A '''variable-width encoding''' is a type of [[character encoding]] scheme in which codes of differing lengths are used to encode a [[character set]] (a repertoire of symbols) for representation, usually in a [[computer]].<ref>{{Cite RFC\|last=Crispin\|first=M.\|date=1 April 2005\|title=UTF-9 and UTF-18 Efficient Transformation Formats of Unicode\|doi=10.17487/rfc4042\|doi-access=}}</ref>{{efn\|The concept long precedes the advent of the electronic computer, however, as seen with [[Morse code]].}} Most common variable-width encodings are '''multibyte encodings''' (aka '''MBCS''' – '''multi-byte character set'''), which use varying numbers of [[byte]]s ([[octet (computing)\|octets]]) to encode different characters. (Some authors, notably in [[Microsoft]] documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set.) (Some authors, notably in [[Microsoft]] documentation, use the term ''multibyte character set,'' which is a [[misnomer]], because representation size is an attribute of the encoding, not of the character set.) Early variable -width encodings using less than a byte per character were sometimes used to pack English text into fewer bytes in [[adventure game]]s for early [[microcomputer]]s. However [[disk storage\|disks]] (which unlike tapes allowed random access allowing text to be loaded on demand), increases in computer memory and general purpose [[compression algorithm]]s have rendered such tricks largely obsolete. Multibyte encodings are usually the result of a need to increase the number of characters which can be encoded without breaking [[backward compatibility]] with an existing constraint. For example, with one byte (8  bits) per character, one can encode 256 possible characters; in order to encode more than 256 characters, the obvious choice would be to use two or more bytes per encoding unit, two bytes (16  bits) would allow 65,536 possible characters, but such a change would break compatibility with existing systems and therefore might not be feasible at all.{{efn\|As a real-life example of this, [[UTF-16]], which represents the most common characters in exactly the manner just described (and uses pairs of 16-bit code units for less-common characters) never gained traction as an encoding for text intended for interchange due to its incompatibility with the ubiquitous 7-/8-bit [[ASCII]] encoding, with its intended role instead being taken by [[UTF-8]], which ''does'' preserve ASCII compatibility.}} ==General structure== Line 33: * [[Lotus Multi-Byte Character Set]] (LMBCS) * [[Triple-Byte Character Set]] (TBCS) * [[Double-byte character set\|Double-Byte Character Set]] (DBCS) * [[SBCS\|Single-Byte Character Set]] (SBCS)