Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
Clarify that UTF-16 (but not UTF-32) can resync on a surrogate pair (but if there are no surrogate pairs in the text stream it is true that the text will be garbled because there is nothing to sync on)
For communication and storage: Surrogate pairs have nothing to do with this, the problem is 1/2 of a code point is missing
Line 36:
UTF-16 and UTF-32 do not have [[endianness]] defined, so a byte order must be selected when receiving them over a byte-oriented network or reading them from a byte-oriented storage. This may be achieved by using a [[byte-order mark]] at the start of the text or assuming big-endian (RFC 2781). [[UTF-8]], [[UTF-16BE]], [[UTF-32BE]], [[UTF-16LE]] and [[UTF-32LE]] are standardised on a single byte order and do not have this problem.
 
If the byte stream is subject to [[data corruption|corruption]] then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronize after a corrupt or missing byte at the start of the next code point; GB 18030 is unable to recover after a corrupt or missing byte until the next ASCII non-number. UTF-16 willcan handle corrupt (''altered)'' bytes by resynchronizing on the next good code point, but not an odd number of lost''missing'' orbytes, spurious [[octet (computing)|byte (octet)]]swhich will garble all the following text unless(though ait surrogatewill pairproduce isuncommon encountered,and/or atunassigned whichcharacters). pointIf it''bits'' willcan be ablelost toall resynchronize.of UTF-32them will handlegarble corruptthe (altered)following bytestext, bythough resynchronizingUTF-8 oncan thebe nextresynchronized goodas codeincorrect point,byte butboundaries one,will two,produce orinvalid threeUTF-8 lostin ortext spuriouslonger [[octetthan (computing)|bytea (octet)]]sfew will garble all the following textbytes.
 
== In detail ==