Variable-width encoding: Difference between revisions

Content deleted Content added
Undid revision 163259609 by 217.122.17.115 (talk)
Line 8:
 
==General structure==
Since the aim of a multibyte encoding system is to minimise changes to existing application software, some characters must retain their pre-existing single-unit codes, even while other characters have multiple units in their codes. The result is that there are three sorts of units in a variable-width encoding: '''singletons''', which consist of a single unit, 'read''lead units''', which come first in a multiunit sequence, and '''trail units''', which come afterwards in a multiunit sequence. Input and display software obviously needs to know about the structure of the multibyte encoding scheme but other software generally doesn't need to know if a pair of bytes represent two separate characters or just one character.
 
For example, the four character string "[[I Love New York|I{{unicode|♥}}NY]]" is encoded in [[UTF-8]] like this (shown as [[hexadecimal]] byte values): <span style="color:green">49</span> <span style="color:red">E2</span> <span style="color:blue">99</span> <span style="color:blue">A5</span> <span style="color:green">4E</span> <span style="color:green">59</span>. Of the six units in that sequence, <span style="color:green">49</span>, <span style="color:green">4E</span>, and <span style="color:green">59</span> are singletons (for ''I, N,'' and ''Y''), <span style="color:red">E2</span> is a lead unit and <span style="color:blue">99</span> and <span style="color:blue">A5</span> are trail units. The heart symbol is represented by the combination of the lead unit and the two trail units.