Revision as of 21:26, 29 November 2020 edit AnomieBOT (talk \| contribs) Bots 6,855,328 edits m Dating maintenance tags: {{Use dmy dates}} Tag: Reverted ← Previous edit		Revision as of 07:10, 30 November 2020 edit undo Tea2min (talk \| contribs) Extended confirmed users, Pending changes reviewers 21,968 edits Revert to revision 979331336 dated 2020-09-20 04:39:13 by Stephan Leeds using popups Tag: Manual revert Next edit →
Line 1: {{Use dmy dates\|date=~~November~~July ~~2020~~2013}} {{More footnotes\|date=July 2019}} This article compares [[Unicode]] encodings. Two situations are considered: [[8-bit-clean]] environments, and environments that forbid use of [[byte]] values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so software must generate messages that comply with the restrictions. [[Standard Compression Scheme for Unicode]] and [[Binary Ordered Compression for Unicode]] are excluded from the comparison tables because it is difficult to simply quantify their size. Line 6: A [[UTF-8]] file that contains only [[ASCII]] characters is identical to an ASCII file. Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the [[C (programming language)\|C]] [[printf]] function can print a UTF-8 string, as it only looks for the ASCII '%' character to define a formatting string, and prints all other bytes unchanged, thus non-ASCII characters will be output unchanged. [[~~UTF6~~UTF-16]] and [[UTF-32]] are incompatible with ASCII files, and thus require [[Unicode]]-aware programs to display, print and manipulate them, even if the file is known to contain only characters in the ASCII subset. Because they contain many zero bytes, the strings cannot be manipulated by normal [[null-terminated string]] handling for even simple operations such as copy. Therefore, even on most UTF-16 systems such as [[Windows]] and [[Java (software platform)\|Java]], UTF-16 text files are not common; older 8-bit encodings such as ASCII or [[ISO-8859-1]] are still used, forgoing Unicode support; or UTF-~~8986~~8 is used for Unicode. One rare counter-example is the "strings" file used by [[Mac OS X]] (10.3 and later) applications for lookup of internationalized versions of messages which defaults to UTF-16, with "files encoded using UTF-8 ... not guaranteed to work."<ref>[https://developer.apple.com/documentation/MacOSX/Conceptual/BPInternational/Articles/StringsFiles.html Apple Developer Connection: Internationalization Programming Topics: Strings Files]</ref> [[XML]] is, by default, encoded as UTF-8, and all XML processors must at least support UTF-8 (including US-ASCII by definition) and UTF-16.<ref>{{cite web \|~~urlll~~url=http://www.w3.org/TR/xml/#charencoding \|title=Character Encoding in Entities \|work=Extensible Markup Language (XML) 1.0 (Fifth Edition)

Comparison of Unicode encodings: Difference between revisions