Revision as of 13:23, 11 February 2020 edit BabelStone (talk \| contribs) Extended confirmed users 15,629 edits Clarify that UTF-16 (but not UTF-32) can resync on a surrogate pair (but if there are no surrogate pairs in the text stream it is true that the text will be garbled because there is nothing to sync on) ← Previous edit		Revision as of 16:31, 11 February 2020 edit undo Spitzak (talk \| contribs) Extended confirmed users 10,503 edits →For communication and storage: Surrogate pairs have nothing to do with this, the problem is 1/2 of a code point is missing Next edit →
Line 36: UTF-16 and UTF-32 do not have [[endianness]] defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using a [[byte-order mark]] at the start of the text or assuming big-endian (RFC 2781). [[UTF-8]], [[UTF-16BE]], [[UTF-32BE]], [[UTF-16LE]] and [[UTF-32LE]] are standardised on a single byte order and do not have this problem. If the byte stream is subject to [[data corruption\|corruption]] then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize after a corrupt or missing byte at the start of the next code point; GB 18030 is unable to recover ~~after a corrupt or missing byte~~ until the next ASCII non-number. UTF-16 ~~will~~can handle ~~corrupt (~~''altered)'' bytes ~~by resynchronizing on the next good code point~~, but not an odd number of ~~lost~~''missing'' orbytes, ~~spurious [[octet (computing)\|byte (octet)]]s~~which will garble all the following text ~~unless~~(though ait ~~surrogate~~will ~~pair~~produce isuncommon ~~encountered,~~and/or atunassigned ~~which~~characters). ~~point~~If it''bits'' ~~will~~can be ~~able~~lost toall ~~resynchronize.~~of ~~UTF-32~~them will ~~handle~~garble ~~corrupt~~the ~~(altered)~~following ~~bytes~~text, bythough ~~resynchronizing~~UTF-8 oncan ~~the~~be ~~next~~resynchronized ~~good~~as ~~code~~incorrect ~~point,~~byte ~~but~~boundaries ~~one,~~will ~~two,~~produce orinvalid ~~three~~UTF-8 ~~lost~~in ortext ~~spurious~~longer ~~[[octet~~than ~~(computing)\|byte~~a ~~(octet)]]s~~few ~~will garble all the following text~~bytes. == In detail ==

Comparison of Unicode encodings: Difference between revisions