Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
move to the for communication section, for processing its accepted standard to use host endian (the same as for any other set of integers) but this is an issue for communitcation/storage
Iaoth (talk | contribs)
Line 18:
Some protocols and file formats may be limited to a specific set of encodings, but even when they are not some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements.
 
Also UTF-16 and UTF-32 are not byte orientated and so a byte order must be selected when transmitting them over a byte orientated network or storing them in a byte orientated file. This may be achived by standardising on a single byte order, by specifying the endian as part of external metadata (for example the MIME charset registry has distinct UTF-16BE and UTF-16LE registrations as well as the plain UTF-16 one) or by using a [[byteByte orderOrder markMark]] at the start of the text.
 
Finally if the bytestream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resyncronise at the start of the next good character. UTF-16 and UTF-32 will handle corrupt bytes well (again recovering on the next good character) but a lost byte will garble all following text. GB18030 may be thrown out of sync by a currupt or missing byte and has no designed in recovery.