Unicode: Difference between revisions

Content deleted Content added
No edit summary
Tags: Reverted Visual edit
 
(18 intermediate revisions by 15 users not shown)
Line 17:
}}
{{Contains special characters|special=uncommon Unicode characters}}
'''Unicode''' or(also known as '''''The Unicode Standard''''' orand '''TUS'''<ref>{{Cite web|date=27 March 2002 |title=Unicode Technical Report #28: Unicode 3.2 |url=https://www.unicode.org/reports/tr28/tr28-3.html#errata |access-date=23 June 2022 |website=Unicode Consortium}}</ref><ref>{{Cite web |last=Jenkins |first=John H. |date=26 August 2021 |title=Unicode Standard Annex #45: U-source Ideographs |url=https://www.unicode.org/reports/tr45/tr45-25.html |access-date=23 June 2022 |website=Unicode Consortium |at=§2.2 The Source Field}}</ref>) is a [[character encoding]] standard maintained by the [[Unicode Consortium]] designed to support the use of text in all of the world's [[writing system]]s that can be digitized. Version 16.0{{efn-ua|name=standard-latest}} defines 154,998 [[Character (computing)|characters]] and 168 [[script (Unicode)|scripts]]<ref>{{multiref |<!-- Graphic + Format count is used here -->{{Cite web|url=https://www.unicode.org/versions/stats/charcountv16_0.html|title=Unicode Character Count V16.0 |date=10 September 2024 |publisher=The Unicode Consortium}} | {{Cite web|title=Unicode 16.0 Versioned Charts Index|url=https://www.unicode.org/charts/PDF/Unicode-16.0/ |publisher=The Unicode Consortium |date=10 September 2024}} | {{Cite web |title=Supported Scripts |url=https://www.unicode.org/standard/supported.html |access-date=11 September 2024 |date=10 September 2024 |publisher=The Unicode Consortium}} }}</ref> used in various ordinary, literary, academic, and technical contexts.
 
Unicode has largely supplanted the previous environment of a myriad of incompatible [[character sets]] used within different locales and on different computer architectures.Entire The entire repertoire of these sets, plus many additional characters, were merged into the single Unicode set. Unicode is used to encode the vast majority of text on the Internet, including most [[web pages]], and relevant Unicode support has become a common consideration in contemporary software development. Unicode is ultimately capable of encoding more than 1.1 million characters.
 
The Unicode [[character repertoire]] is synchronized with [[Universal Coded Character Set|ISO/IEC 10646]], each being code-for-code identical with one another. However, ''The Unicode Standard'' is more than just a repertoire within which characters are assigned. To aid developers and designers, the standard also provides charts and reference data, as well as annexes explaining concepts germane to various scripts, providing guidance for their implementation. Topics covered by these annexes include [[Unicode equivalence#Normalization|character normalization]], [[Combining character|character composition]] and decomposition, [[Unicode collation algorithm|collation]], and [[Bidirectional text#Unicode bidi support|directionality]].<ref>{{Cite web |title=The Unicode Standard: A Technical Introduction |url=https://www.unicode.org/standard/principles.html |date=22 August 2019 |access-date=11 September 2024}}</ref>
Line 75:
[[File:Unicode sample.png|class=skin-invert-image|thumb|right|200px|Many modern applications can render a substantial subset of the many [[scripts in Unicode]], as demonstrated by this screenshot from the [[OpenOffice.org]] application.]]<!-- screenshot fair use rationale: this screenshot is used specifically to illustrate the Unicode-related capabilities of modern desktop applications and the breadth of supported Unicode scripts -->
 
{{As of|September 2024}}, a total of 168<ref>{{Cite web |title=Supported Scripts |url=https://www.unicode.org/standard/supported.html |access-date=16 September 2022 |website=Unicode}}</ref> currently[[Script covers(Unicode)|scripts]] ([[alphabet]]s, [[abugida]]s and [[syllabary|syllabaries]]) are included in Unicode, covering most major [[writing system]]s in use today.<ref>{{Cite book |last=Otung |first=Ifiok |url=https://books.google.com/books?id=4OMXEAAAQBAJ&q=unicode+covers+almost+all+characters |title=Communication Engineering Principles |date=2021-01-28 |publisher=John Wiley & Sons |isbn=978-1-119-27407-0 |language=en|page=12}}</ref><ref>{{Cite web |title=Unicode FAQ |url=https://home.unicode.org/basic-info/faq/ |access-date=2 April 2020}}</ref> There are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as [[Symbol|symbols]], in particular for mathematics and [[musical notation|music]] also occur.
 
=== Proposals for adding scripts ===
{{As of|2024}}, a total of 168 [[Script (Unicode)|scripts]]<ref>{{Cite web |title=Supported Scripts |url=https://www.unicode.org/standard/supported.html |access-date=16 September 2022 |website=Unicode}}</ref> are included in the latest version of Unicode (covering [[alphabet]]s, [[abugida]]s and [[syllabary|syllabaries]]), although there are still scripts that are not yet encoded, particularly those mainly used in historical, liturgical, and academic contexts. Further additions of characters to the already encoded scripts, as well as symbols, in particular for mathematics and [[musical notation|music]] (in the form of notes and rhythmic symbols), also occur.
 
The Unicode Roadmap Committee ([[Michael Everson]], Rick McGowan, Ken Whistler, V.S. Umamaheswaran)<ref>{{Cite web |title=Roadmap to the BMP |url=https://www.unicode.org/roadmaps/bmp/ |access-date=30 July 2018 |publisher=[[Unicode Consortium]]}}</ref> maintain the list of scripts that are candidates or potential candidates for encoding and their tentative code block assignments on the Unicode Roadmap<ref>{{Cite web|url=https://www.unicode.org/roadmaps/|title=Roadmaps to Unicode|website=Unicode |url-status=live |archive-url= https://web.archive.org/web/20231208091250/http://www.unicode.org/roadmaps/ |archive-date= Dec 8, 2023 }}</ref> page of the [[Unicode Consortium]] website. For some scripts on the Roadmap, such as [[Jurchen script|Jurchen]] and [[Khitan large script]], encoding proposals have been made and they are working their way through the approval process. For other scripts, such as [[Numidian language|Numidian]] and [[Rongorongo]], no proposal has yet been made, and they await agreement on character repertoire and other details from the user communities involved.
Line 85:
There is also a [[Medieval Unicode Font Initiative]] focused on special Latin medieval characters. Part of these proposals has been already included in Unicode.
 
=== {{anchor|Script Encoding Initiative}} Script Encoding Initiative ===
The Script Encoding Initiative (SEI),<ref>{{Cite web|url=https://linguisticssei.berkeley.edu/sei/ |title=scriptScript encodingEncoding Initiative initiative|website=BerkeleyScript LinguisticsEncoding Initiative |url-status=live |archive-url=https://web.archive.org/web/20230325131114/https://linguistics.berkeley.edu/sei/ |archive-date= Mar 25, 2023 }}</ref> a project runcreated by Deborah Anderson at the [[University of California, Berkeley]], was founded in 2002 with the goal of funding proposals for scripts not yet encoded in the standard. TheNow projectrun by Anushah Hossain, SEI has become a major source of proposed additions to the standard in recent years.<ref>{{Cite web |title=About The Script Encoding Initiative |url=https://www.unicode.org/pending/about-sei.html |access-date=4 June 2012 |publisher=The Unicode Consortium}}</ref> Although SEI collaborates with the Unicode Consortium and the ISO/IEC 10646 standards process, it operates independently, supporting the technical, linguistic, and historical research needed to prepare formal proposals. SEI maintains a database of scripts that have yet to be encoded in the Unicode Standard on the project's website.<ref>{{Cite web |title=Scripts to Encode |url=https://sei.berkeley.edu/scripts-to-encode/ }}</ref>
 
=== Versions ===
Line 402:
''The Unicode Standard'' defines a ''codespace'':<ref name="Glossary">{{Cite web |title=Glossary of Unicode Terms |url=https://unicode.org/glossary/ |access-date=16 March 2010}}</ref> a sequence of integers called ''[[code point]]s''<ref name=":0">{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G25564 |title=The Unicode Standard Version 16.0 – Core Specification |year=2024 |chapter=2.4 Code Points and Characters}}</ref> in the range from 0 to {{val|1114111}}, notated according to the standard as {{tt|U+0000}}–{{tt|U+10FFFF}}.<ref>{{Cite book |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 |title=The Unicode Standard, Version 16.0 |year=2024 |chapter=3.4 Characters and Encoding}}</ref> The codespace is a systematic, architecture-independent representation of ''The Unicode Standard''; actual text is processed as binary data via one of several Unicode encodings, such as [[UTF-8]].
 
In this normative notation, the two-character prefix <code>U+</code> always precedes a written code point, and the code points themselves are written as [[hexadecimal]] numbers.{{Refn|The two-character prefix <code>U+</code> was chosen as an ASCII approximation of {{unichar|U+228E}}.<ref>{{Cite mailing list |url=https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html |title=Re: Origin of the U+nnnn notation |date=8 November 2005 |mailing-list=Unicode Mail List Archive}}</ref> and the code points themselves are written as [[hexadecimal]] numbers.|group=note}} At least four hexadecimal digits are always written, with [[leading zero]]s prepended as needed. For example, the code point {{unichar|F7|Division sign}} is padded with two leading zeros, but {{unichar|13254|Egyptian hieroglyph O004}} ([[File:Hiero O4.png|class=skin-invert-image|text-bottom|15px]]) is not padded.<ref>{{Cite web |date=September 2024 |title=Appendix A: Notational Conventions |url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/appendix-a/ |website=The Unicode Standard |publisher=Unicode Consortium}}</ref>
 
There are a total of {{val|1112064}} valid code points within the codespace.<ref>{{cite book |title=The Unicode Standard |publisher=[[The Unicode Consortium]] |isbn=978-1-936213-01-6 |edition=6.0 |___location=Mountain View, California, US |at=3.9 Unicode Encoding Forms |chapter=Conformance |quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF |chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref> This number arises from the limitations of the [[UTF-16]] character encoding, which can encode the 2<sup>16</sup> code points in the range {{tt|U+0000}} through {{tt|U+FFFF}} except for the 2<sup>11</sup> code points in the range {{tt|U+D800}} through {{tt|U+DFFF}}, which are used as surrogate pairs to encode the 2<sup>20</sup> code points in the range {{tt|U+10000}} through {{tt|U+10FFFF}}.
Line 846:
 
=== Security<span class="anchor" id="Security issues"></span> ===
Unicode has a large number of [[homoglyphs]], many of which look very similar or identical to ASCII letters. Substitution of these can make an identifier or URL that looks correct, but directs to a different ___location than expected.<ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |website=Unicode}}</ref> Additionally, homoglyphs can also be used for manipulating the output of [[NLP (computer science)|natural language processing (NLP)]] systems.<ref>{{Cite book |last1=Boucher |first1=Nicholas |last2=Shumailov |first2=Ilia |last3=Anderson |first3=Ross |last4=Papernot |first4=Nicolas |title=2022 IEEE Symposium on Security and Privacy (SP) |chapter=Bad Characters: Imperceptible NLP Attacks |year=2022 |chapter-url=https://ieeexplore.ieee.org/document/9833641 |___location=San Francisco, CA, US |publisher=IEEE |pages=1987–2004 |arxiv=2106.09898 |doi=10.1109/SP46214.2022.9833641 |isbn=978-1-66541-316-9 |s2cid=235485405}}</ref> Mitigation requires disallowing these characters, displaying them differently, or requiring that they resolve to the same identifier;<ref>{{Cite web |last=Engineering |first=Spotify |date=2013-06-18 |title=Creative usernames and Spotify account hijacking |url=https://engineering.atspotify.com/2013/06/creative-usernames/ |access-date=2023-04-15 |website=Spotify Engineering |language=en-US}}</ref> all of this is complicated due to the huge and constantly changing set of characters.<ref>{{cite tech report | last=Wheeler | first=David A. | title=Initial Analysis of Underhanded Source Code | year=2020 | jstor=resrep25332.7 | url=http://www.jstor.org/stable/resrep25332.7 | page=4–1–4–10}}</ref><ref>{{Cite web |title=UTR #36: Unicode Security Considerations |url=https://unicode.org/reports/tr36/ |access-date=27 June 2022 |website=Unicode}}</ref>
 
A security advisory was released in 2021 by two researchers, one from the [[University of Cambridge]] and the other from the [[University of Edinburgh]], in which they assert that the [[Bidirectional Text#Bidirectional text#Explicit formatting|BiDi marks]] can be used to make large sections of code do something different from what they appear to do. The problem was named "[[Trojan Source]]".<ref>{{Cite web |first1=Nicholas |last1=Boucher |first2=Ross |last2=Anderson |title=Trojan Source: Invisible Vulnerabilities |url=https://www.trojansource.codes/trojan-source.pdf |access-date=2 November 2021}}</ref> In response, code editors started highlighting marks to indicate forced text-direction changes.<ref>{{Cite web |title=Visual Studio Code October 2021 |url=https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters |access-date=11 November 2021 |website=code.visualstudio.com |language=en}}</ref>