Comparison of Unicode encodings: Difference between revisions

Content deleted Content added
Mordemur (talk | contribs)
m wikify
Line 6:
==Considerations other than size==
===For processing===
For processing a format should be easy to search, truncate, and generally process safely. Fixed-sizeAll charactersnormal canunicode beencodings helpful,use butsome itform shouldof befixed rememberedsize thatcode evenunit. ifDepending thereon isthe aformat fixedand width perthe code point (asto inbe UTF-32),encoded thereone isor notmore aof fixedthese widthcode perunits displayedwill characterrepresent duea tounicode [[combiningcode character]]spoint. AlsoTo ifallow youeasy are working with a particular [[application programming interface|API]] heavilysearching and that API has standardised ontruncation a particularsequence Unicodemust encoding itnot isoccour generallywithin a goodlonger ideasequence toor useaccross the encodingboundry thatof thetwo API does to avoid the need to convert before every call to theother APIsequences. UTF-8,UTF-16 isand popularUTF-32 becausehave manytheese APIsimportant dateproperties to the time when Unicode was 16-bit fixed width. Unfortunately usingbut UTF-167 makesand charactersGB18030 outsidedo thenot. BMP a special case which increases the risk of oversights related to their handling.
 
Fixed-size characters can be helpful, but it should be remembered that even if there is a fixed width per code point (as in UTF-32), there is not a fixed width per displayed character due to [[combining character]]s. If you are working with a particular [[application programming interface|API]] heavily and that API has standardised on a particular Unicode encoding it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Simerally if you are writing a network daemon it may simplify matters to use the same format for processing that you are communicating in.
 
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Unfortunately using UTF-16 makes characters outside the BMP a special case which increases the risk of oversights related to their handling.
 
===For communication===
Some protocols may be limited to a specific set of encodings, but even when they are not some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements. It may simplify matters to use the same format for processing that you are communicating in, especially for servers.
 
==In detail==