Content deleted Content added
(11 intermediate revisions by 8 users not shown) | |||
Line 75:
[[File:Unicode sample.png|class=skin-invert-image|thumb|right|200px|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts -->
{{As of|September 2024}}, a total of 168<ref>{{Cite web |title=Supported Scripts |url=https://www.unicode.org/standard/supported.html |access-date=16 September 2022 |website=Unicode}}</ref>
=== Proposals for adding scripts ===
The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran)<ref>{{Cite web |title=Roadmap to the BMP |url=https://www.unicode.org/roadmaps/bmp/ |access-date=30 July 2018 |publisher=[[Unicode Consortium]]}}</ref> maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap<ref>{{Cite web|url=https://www.unicode.org/roadmaps/|title=Roadmaps to Unicode|website=Unicode |url-status=live |archive-url= https://web.archive.org/web/20231208091250/http://www.unicode.org/roadmaps/ |archive-date= Dec 8, 2023 }}</ref> page of the [[Unicode Consortium]] website. For some scripts on the Roadmap, such as [[Jurchen script|Jurchen]] and [[Khitan large script]], encoding proposals have been made and they are working their way through the approval process. For other scripts, such as [[Numidian language|Numidian]] and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
Line 85:
There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals has been already included in Unicode.
The Script Encoding Initiative (SEI),<ref>{{Cite web|url=https://
=== Versions ===
Line 402:
''The Unicode Standard'' defines a ''codespace'':<ref name="Glossary">{{Cite web |title=Glossary of Unicode Terms |url=https://unicode.org/glossary/ |access-date=16 March 2010}}</ref> a sequence of integers called ''[[code point]]s''<ref name=":0">{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G25564 |title=The Unicode Standard Version 16.0 – Core Specification |year=2024 |chapter=2.4 Code Points and Characters}}</ref> in the range from 0 to {{val|1114111}}, notated according to the standard as {{tt|U+0000}}–{{tt|U+10FFFF}}.<ref>{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 |title=The Unicode Standard, Version 16.0 |year=2024 |chapter=3.4 Characters and Encoding}}</ref> The codespace is a systematic, architecture-independent representation of ''The Unicode Standard''; actual text is processed as binary data via one of several Unicode encodings, such as [[UTF-8]].
In this normative notation, the two-character prefix <code>U+</code> always precedes a written code point, and the code points themselves are written as [[hexadecimal]] numbers.{{Refn|The two-character prefix <code>U+</code> was chosen as an ASCII approximation of {{unichar|U+228E}}.<ref>{{Cite mailing list |url=https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html |title=Re: Origin of the U+nnnn notation |date=8 November 2005 |mailing-list=Unicode Mail List Archive}}</ref>
There are a total of {{val|1112064}} valid code points within the codespace.<ref>{{cite book |title=The Unicode Standard |publisher=[[The Unicode Consortium]] |isbn=978-1-936213-01-6 |edition=6.0 |___location=Mountain View, California, US |at=3.9 Unicode Encoding Forms |chapter=Conformance |quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF |chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref> This number arises from the limitations of the [[UTF-16]] character encoding, which can encode the 2<sup>16</sup> code points in the range {{tt|U+0000}} through {{tt|U+FFFF}} except for the 2<sup>11</sup> code points in the range {{tt|U+D800}} through {{tt|U+DFFF}}, which are used as surrogate pairs to encode the 2<sup>20</sup> code points in the range {{tt|U+10000}} through {{tt|U+10FFFF}}.
Line 846:
=== Security<span class="anchor" id="Security issues"></span> ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different ___location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022
A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>
|