Content deleted Content added
move to the for communication section, for processing its accepted standard to use host endian (the same as for any other set of integers) but this is an issue for communitcation/storage |
|||
Line 13:
Fixed-size characters can be helpful, but it should be remembered that even if there is a fixed width per code point (as in UTF-32), there is not a fixed width per displayed character due to [[combining character]]s. If you are working with a particular [[application programming interface|API]] heavily and that API has standardised on a particular Unicode encoding it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server side software it may simplify matters to use the same format for processing that you are communicating in.
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Unfortunately using UTF-16 makes characters outside the BMP a special case which increases the risk of oversights related to their handling
===For communication and storage===
Some protocols and file formats may be limited to a specific set of encodings, but even when they are not some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements.
Also UTF-16 and UTF-32 are not byte orientated and so a byte order must be selected when transmitting them over a byte orientated network or storing them in a byte orientated file. This may be achived by standardising on a single byte order, by specifying the endian as part of external metadata (for example the MIME charset registry has distinct UTF-16BE and UTF-16LE registrations as well as the plain UTF-16 one) or by using a [[byte order mark]] at the start of the text.
Finally if the bytestream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resyncronise at the start of the next good character. UTF-16 and UTF-32 will handle corrupt bytes well (again recovering on the next good character) but a lost byte will garble all following text. GB18030 may be thrown out of sync by a currupt or missing byte and has no designed in recovery.
==In detail==
|