Content deleted Content added
No edit summary Tags: Reverted Visual edit Mobile edit Mobile web edit |
|||
(20 intermediate revisions by 17 users not shown) | |||
Line 17:
}}
{{Contains special characters|special=uncommon Unicode characters}}
'''Unicode'''
Unicode has largely supplanted the previous environment of
The Unicode [[character repertoire]] is synchronized with [[Universal Coded Character Set|ISO/IEC 10646]], each being code-for-code identical with one another. However, ''The Unicode Standard'' is more than just a repertoire within which characters are assigned. To aid developers and designers, the standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes include [[Unicode equivalence#Normalization|character normalization]], [[Combining character|character composition]] and decomposition, [[Unicode collation algorithm|collation]], and [[Bidirectional text#Unicode bidi support|directionality]].<ref>{{Cite web |title=The Unicode Standard: A Technical Introduction |url=https://www.unicode.org/standard/principles.html |date=22 August 2019 |access-date=11 September 2024}}</ref>
Line 39:
=== {{anchor|Unicode 88}}History ===
The origins of Unicode can be traced back to the 1980s, to a group of individuals with connections to [[Xerox]]'s [[Xerox Character Code Standard|Character Code Standard]] (XCCS).<ref name="unicode-88" /> In 1987, Xerox employee [[Joe Becker (Unicode)|Joe Becker]], along with [[Apple Inc.|Apple]] employees [[Lee Collins (Unicode)|Lee Collins]] and [[Mark Davis (Unicode)|Mark Davis]], started investigating the practicalities of creating a universal character set.<ref>{{Cite web |title=Summary Narrative |url=https://www.unicode.org/history/summary.html |website=Unicode |date=August 31, 2006 |access-date=15 March 2010}}</ref> With additional input from Peter Fenwick and [[Dave Opstad]],<ref name="unicode-88" /> Becker published a draft proposal for an "international/multilingual text character encoding system in August 1988, tentatively called Unicode". He explained that "the name 'Unicode' is intended to suggest a unique, unified, universal encoding".<ref name="unicode-88">{{Cite web |last=Becker |first=Joseph D. |author-link=Joseph D. Becker |date=10 September 1998 |title=Unicode 88 |url=https://unicode.org/history/unicode88.pdf |url-status=live |archive-url=https://web.archive.org/web/20161125224409/https://unicode.org/history/unicode88.pdf |archive-date=25 November 2016 |access-date=25 October 2016 |publisher=[[Unicode Consortium]] |quote=In 1978, the initial proposal for a set of "Universal Signs" was made by [[Bob Belleville]] at [[Xerox PARC]]. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the [[Xerox Character Code Standard]] (XCCS) by the present author, a multilingual encoding that has been maintained by Xerox as an internal corporate standard since 1982, through the efforts of Ed Smura, Ron Pellar, and others.<br />Unicode arose as the result of eight years of working experience with XCCS. Its fundamental differences from XCCS were proposed by Peter Fenwick and Dave Opstad (pure 16-bit codes) and by [[Lee Collins (Unicode)|Lee Collins]] (ideographic character unification). Unicode retains the many features of XCCS whose utility has been proved over the years in an international line of communication multilingual system products. |orig-year=1988-08-29}}</ref>
In this document, entitled ''Unicode 88'', Becker outlined a scheme using [[16-bit computing|16-bit]] characters:<ref name="unicode-88" />
<blockquote>
Unicode is intended to address the need for a workable, reliable world text encoding. Unicode could be roughly described as "wide-body [[ASCII]]" that has been stretched to 16 bits to encompass the characters of all the world's living languages. In a properly engineered design, 16 bits per character are more than sufficient for this purpose.
</blockquote>
Line 53:
</blockquote>
In early 1989, the Unicode working group expanded to include Ken Whistler and Mike Kernaghan of Metaphor, Karen Smith-Yoshimura and Joan Aliprand of [[Research Libraries Group]], and Glenn Wright of [[Sun Microsystems]]. In 1990, Michel Suignard and Asmus Freytag of [[Microsoft]] and [[NeXT]]'s Rick McGowan had also joined the group. By the end of 1990, most of the work of remapping existing standards had been completed, and a final review draft of Unicode was ready.
The [[Unicode Consortium]] was incorporated in California on 3 January 1991,<ref>{{Cite web |title=History of Unicode Release and Publication Dates |url=https://unicode.org/history/publicationdates.html |access-date=20 March 2023 |website=Unicode}}</ref> and the first volume of ''The Unicode Standard'' was published that October. The second volume, now adding Han ideographs, was published in June 1992.
In 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. This increased the Unicode codespace to over a million code points, which allowed for the encoding of many historic scripts, such as [[Egyptian hieroglyphs]], and thousands of rarely used or obsolete characters that had not been anticipated for inclusion in the standard. Among these characters are various rarely used [[CJK
Version 1.0 of Microsoft's TrueType specification, published in 1992, used the name "Apple Unicode" instead of "Unicode" for the Platform ID in the naming table.
Line 75:
[[File:Unicode sample.png|class=skin-invert-image|thumb|right|200px|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts -->
{{As of|September 2024}}, a total of 168<ref>{{Cite web |title=Supported Scripts |url=https://www.unicode.org/standard/supported.html |access-date=16 September 2022 |website=Unicode}}</ref>
=== Proposals for adding scripts ===
The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran)<ref>{{Cite web |title=Roadmap to the BMP |url=https://www.unicode.org/roadmaps/bmp/ |access-date=30 July 2018 |publisher=[[Unicode Consortium]]}}</ref> maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap<ref>{{Cite web|url=https://www.unicode.org/roadmaps/|title=Roadmaps to Unicode|website=Unicode |url-status=live |archive-url= https://web.archive.org/web/20231208091250/http://www.unicode.org/roadmaps/ |archive-date= Dec 8, 2023 }}</ref> page of the [[Unicode Consortium]] website. For some scripts on the Roadmap, such as [[Jurchen script|Jurchen]] and [[Khitan large script]], encoding proposals have been made and they are working their way through the approval process. For other scripts, such as [[Numidian language|Numidian]] and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
Line 85:
There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals has been already included in Unicode.
The Script Encoding Initiative (SEI),<ref>{{Cite web|url=https://
=== Versions ===
Line 402:
''The Unicode Standard'' defines a ''codespace'':<ref name="Glossary">{{Cite web |title=Glossary of Unicode Terms |url=https://unicode.org/glossary/ |access-date=16 March 2010}}</ref> a sequence of integers called ''[[code point]]s''<ref name=":0">{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G25564 |title=The Unicode Standard Version 16.0 – Core Specification |year=2024 |chapter=2.4 Code Points and Characters}}</ref> in the range from 0 to {{val|1114111}}, notated according to the standard as {{tt|U+0000}}–{{tt|U+10FFFF}}.<ref>{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 |title=The Unicode Standard, Version 16.0 |year=2024 |chapter=3.4 Characters and Encoding}}</ref> The codespace is a systematic, architecture-independent representation of ''The Unicode Standard''; actual text is processed as binary data via one of several Unicode encodings, such as [[UTF-8]].
In this normative notation, the two-character prefix <code>U+</code> always precedes a written code point, and the code points themselves are written as [[hexadecimal]] numbers.{{Refn|The two-character prefix <code>U+</code> was chosen as an ASCII approximation of {{unichar|U+228E}}.<ref>{{Cite mailing list |url=https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html |title=Re: Origin of the U+nnnn notation |date=8 November 2005 |mailing-list=Unicode Mail List Archive}}</ref>
There are a total of {{val|1112064}} valid code points within the codespace.<ref>{{cite book |title=The Unicode Standard |publisher=[[The Unicode Consortium]] |isbn=978-1-936213-01-6 |edition=6.0 |___location=Mountain View, California, US |at=3.9 Unicode Encoding Forms |chapter=Conformance |quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF |chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref> This number arises from the limitations of the [[UTF-16]] character encoding, which can encode the 2<sup>16</sup> code points in the range {{tt|U+0000}} through {{tt|U+FFFF}} except for the 2<sup>11</sup> code points in the range {{tt|U+D800}} through {{tt|U+DFFF}}, which are used as surrogate pairs to encode the 2<sup>20</sup> code points in the range {{tt|U+10000}} through {{tt|U+10FFFF}}.
Line 846:
=== Security<span class="anchor" id="Security issues"></span> ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different ___location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022
A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>
|