Revision as of 17:33, 9 August 2025 edit Warudo (talk \| contribs) Extended confirmed users 9,373 edits →Codespace and code points: Added the origin of U+. The cited source mentions it. Tags: Mobile edit Mobile web edit Advanced mobile edit ← Previous edit		Revision as of 04:19, 13 August 2025 edit undo Netjeff (talk \| contribs) Extended confirmed users 1,507 edits →Codespace and code points: Move reason for "U+" into a {{refn}} note Tag: Visual edit Next edit →
Line 402: ''The Unicode Standard'' defines a ''codespace'':<ref name="Glossary">{{Cite web \|title=Glossary of Unicode Terms \|url=https://unicode.org/glossary/ \|access-date=16 March 2010}}</ref> a sequence of integers called ''[[code point]]s''<ref name=":0">{{Cite book \|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G25564 \|title=The Unicode Standard Version 16.0 – Core Specification \|year=2024 \|chapter=2.4 Code Points and Characters}}</ref> in the range from 0 to {{val\|1114111}}, notated according to the standard as {{tt\|U+0000}}–{{tt\|U+10FFFF}}.<ref>{{Cite book \|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2212 \|title=The Unicode Standard, Version 16.0 \|year=2024 \|chapter=3.4 Characters and Encoding}}</ref> The codespace is a systematic, architecture-independent representation of ''The Unicode Standard''; actual text is processed as binary data via one of several Unicode encodings, such as [[UTF-8]]. In this normative notation, the two-character prefix <code>U+</code> always precedes a written code point, and the code points themselves are written as [[hexadecimal]] numbers.{{Refn\|The two-character prefix <code>U+</code> was chosen as an ASCII approximation of {{unichar\|U+228E}}~~, always precedes a written code point,~~.<ref>{{Cite mailing list \|url=https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html \|title=Re: Origin of the U+nnnn notation \|date=8 November 2005 \|mailing-list=Unicode Mail List Archive}}</ref> ~~and the code points themselves are written as [[hexadecimal]] numbers.~~\|group=note}} At least four hexadecimal digits are always written, with [[leading zero]]s prepended as needed. For example, the code point {{unichar\|F7\|Division sign}} is padded with two leading zeros, but {{unichar\|13254\|Egyptian hieroglyph O004}} ([[File:Hiero O4.png\|class=skin-invert-image\|text-bottom\|15px]]) is not padded.<ref>{{Cite web \|date=September 2024 \|title=Appendix A: Notational Conventions \|url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/appendix-a/ \|website=The Unicode Standard \|publisher=Unicode Consortium}}</ref> There are a total of {{val\|1112064}} valid code points within the codespace.<ref>{{cite book \|title=The Unicode Standard \|publisher=[[The Unicode Consortium]] \|isbn=978-1-936213-01-6 \|edition=6.0 \|___location=Mountain View, California, US \|at=3.9 Unicode Encoding Forms \|chapter=Conformance \|quote=Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF \|chapter-url=https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G7404}}</ref> This number arises from the limitations of the [[UTF-16]] character encoding, which can encode the 2<sup>16</sup> code points in the range {{tt\|U+0000}} through {{tt\|U+FFFF}} except for the 2<sup>11</sup> code points in the range {{tt\|U+D800}} through {{tt\|U+DFFF}}, which are used as surrogate pairs to encode the 2<sup>20</sup> code points in the range {{tt\|U+10000}} through {{tt\|U+10FFFF}}.

Unicode: Difference between revisions