Content deleted Content added
Restored revision 1287006593 by 49.204.99.13 (talk): Actually I suppose this is equally the case here |
|||
(44 intermediate revisions by 32 users not shown) | |||
Line 17:
}}
{{Contains special characters|special=uncommon Unicode characters}}
'''Unicode'''
The Unicode [[character repertoire]] is synchronized with [[Universal Coded Character Set|ISO/IEC 10646]], each being code-for-code identical with one another. However, ''The Unicode Standard'' is more than just a repertoire within which characters are assigned. To aid developers and designers, the standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes include [[Unicode equivalence#Normalization|character normalization]], [[Combining character|character composition]] and decomposition, [[Unicode collation algorithm|collation]], and [[Bidirectional text#Unicode bidi support|directionality]].<ref>{{Cite web |title=The Unicode Standard: A Technical Introduction |url=https://www.unicode.org/standard/principles.html |date=22 August 2019 |access-date=11 September 2024}}</ref>
Unicode encodes 3,790 [[emoji]], with the continued development thereof conducted by the Consortium as a part of the standard.<ref>{{Cite web |title=Emoji Counts, v16.0 |url=https://www.unicode.org/emoji/charts-16.0/emoji-counts.html |access-date=10 September 2024 |publisher=The Unicode Consortium}}</ref> The widespread adoption of Unicode was in large part responsible for the initial popularization of emoji outside of Japan.{{citation needed|date=June 2025}}
Unicode text is processed and stored as binary data [[comparison of Unicode encodings|using one of several encodings]], which define how to translate the standard's abstracted codes for characters into sequences of bytes. ''The Unicode Standard'' itself defines three encodings: [[UTF-8]], [[UTF-16]], and [[UTF-32]], though several others exist. Of these, UTF-8 is the most widely used by a large margin, in part due to its backwards-compatibility with [[ASCII]].▼
▲Unicode text is processed and stored as binary data [[comparison of Unicode encodings|using one of several encodings]], which define how to translate the standard's abstracted codes for characters into sequences of bytes. ''The Unicode Standard'' itself defines three encodings: [[UTF-8]], [[UTF-16]],{{efn|A large amount of documentation for Windows incorrectly uses the term "Unicode" to mean ''only'' the UTF-16 encoding.}} and [[UTF-32]], though several others exist.
== Origin and development ==
Line 36:
The first 256 code points mirror the [[ISO/IEC 8859-1]] standard, with the intent of trivializing the conversion of text already written in Western European scripts. To preserve the distinctions made by different legacy encodings, therefore allowing for conversion between them and Unicode without any loss of information, many [[duplicate characters in Unicode|characters nearly identical to others]], in both appearance and intended function, were given distinct code points. For example, the [[Halfwidth and Fullwidth Forms (Unicode block)|Halfwidth and Fullwidth Forms]] block encompasses a full semantic duplicate of the Latin alphabet, because legacy [[CJK characters|CJK encodings]] contained both "fullwidth" (matching the width of CJK characters) and "halfwidth" (matching ordinary Latin script) characters.
The Unicode Bulldog Award is given to people deemed to be influential in Unicode's development, with recipients including [[Tatsuo Kobayashi]], Thomas Milo,
=== {{anchor|Unicode 88}}History ===
Line 64:
{{Main|Unicode Consortium}}
The Unicode Consortium is a
Over the years several countries or government agencies have been members of the Unicode Consortium.<ref name="members" />
Line 75:
[[File:Unicode sample.png|class=skin-invert-image|thumb|right|200px|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts -->
{{As of|September 2024}}, a total of 168<ref>{{Cite web |title=Supported Scripts |url=https://www.unicode.org/standard/supported.html |access-date=16 September 2022 |website=Unicode}}</ref>
=== Proposals for adding scripts ===
The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran)<ref>{{Cite web |title=Roadmap to the BMP |url=https://www.unicode.org/roadmaps/bmp/ |access-date=30 July 2018 |publisher=[[Unicode Consortium]]}}</ref> maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap<ref>{{Cite web|url=https://www.unicode.org/roadmaps/|title=Roadmaps to Unicode|website=Unicode |url-status=live |archive-url= https://web.archive.org/web/20231208091250/http://www.unicode.org/roadmaps/ |archive-date= Dec 8, 2023 }}</ref> page of the [[Unicode Consortium]] website. For some scripts on the Roadmap, such as [[Jurchen script|Jurchen]] and [[Khitan large script]], encoding proposals have been made and they are working their way through the approval process. For other scripts, such as [[Numidian language|Numidian]] and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
Line 85:
There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals has been already included in Unicode.
The Script Encoding Initiative (SEI),<ref>{{Cite web|url=https://
=== Versions ===
Line 149:
| rowspan="2" | 25
| {{val|38,885}}{{su|p={{val|+11,373}}|b={{val|−6,656}}}}
| style="text-align:left" | Original set of Hangul syllables removed, new set of 11,172 Hangul syllables added at new ___location, Tibetan added back in a new ___location and with a different character repertoire, Surrogate character mechanism defined, Plane 15 and Plane 16
-->
Line 394:
=== Projected versions ===
The Unicode Consortium normally releases a new version of ''The Unicode Standard'' once a year. Version 17.0, the next major version, is projected to include 4301 new unified [[CJK characters]], CJK Unified Ideographs Extension J.<ref>{{Cite web|url=https://unicode.org/alloc/Pipeline.html|title=Proposed New Characters: The Pipeline|date=September 10, 2024|website=Unicode|accessdate=September 13, 2024}}</ref><ref>{{Cite web|url=https://emojipedia.org/unicode-16.0|title=Unicode Version 16.0|website=emojipedia.org|accessdate=September 13, 2023}}</ref>
== Architecture and terminology ==
Line 402:
''The Unicode Standard'' defines a ''codespace'':<ref name="Glossary">{{Cite web |title=Glossary of Unicode Terms |url=https://unicode.org/glossary/ |access-date=16 March 2010}}</ref> a sequence of integers called ''[[code point]]s''<ref name=":0">{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G25564 |title=The Unicode Standard Version 16.0 – Core Specification |year=2024 |chapter=2.4 Code Points and Characters}}</ref> in the range from 0 to {{val|1114111}}, notated according to the standard as {{tt|U+0000}}–{{tt|U+10FFFF}}.<ref>{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 |title=The Unicode Standard, Version 16.0 |year=2024 |chapter=3.4 Characters and Encoding}}</ref> The codespace is a systematic, architecture-independent representation of ''The Unicode Standard''; actual text is processed as binary data via one of several Unicode encodings, such as [[UTF-8]].
In this normative notation, the two-character prefix <code>U+</code> always precedes a written code point, and the code points themselves are written as [[hexadecimal]] numbers.{{Refn|The two-character prefix <code>U+</code> was chosen as an ASCII approximation of {{unichar|U+228E}}.<ref>{{Cite mailing list |url=https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html |title=Re: Origin of the U+nnnn notation |date=8 November 2005 |mailing-list=Unicode Mail List Archive}}</ref>
There are a total of {{val|1112064}} valid code points within the codespace.<ref>{{cite book |title=The Unicode Standard |publisher=[[The Unicode Consortium]] |isbn=978-1-936213-01-6 |edition=6.0 |___location=Mountain View, California, US |at=3.9 Unicode Encoding Forms |chapter=Conformance |quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF |chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref> This number arises from the limitations of the [[UTF-16]] character encoding, which can encode the 2<sup>16</sup> code points in the range {{tt|U+0000}} through {{tt|U+FFFF}} except for the 2<sup>11</sup> code points in the range {{tt|U+D800}} through {{tt|U+DFFF}}, which are used as surrogate pairs to encode the 2<sup>20</sup> code points in the range {{tt|U+10000}} through {{tt|U+10FFFF}}.
Line 787:
| url = http://cajun.cs.nott.ac.uk/wiley/journals/epobetan/pdf/volume6/issue3/bigelow.pdf
| journal = Electronic Publishing
|
| volume = 6
| issue = 3
Line 846:
=== Security<span class="anchor" id="Security issues"></span> ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different ___location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022
A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>
|