Script (Unicode): Difference between revisions

Content deleted Content added
No edit summary
External links: Updated URL for SEI website
 
(13 intermediate revisions by 8 users not shown)
Line 5:
{{ISO 15924/unicode-script-illustration}}
 
In [[Unicode]], a '''script''' is a collection of [[Letter (alphabet)|letter]]s and other written signs used to represent textual information in one or more [[writing system]]s.<ref>{{cite web|url=http://unicode.org/glossary/|title=Glossary|website=unicode.org}}</ref> Some scripts support one and only one writing system and [[Written language|language]], for example, [[Armenian language|Armenian]]. Other scripts support many different writing systems; for example, the [[Latin script in Unicode|Latin script]] supports [[English alphabet|English]], [[French alphabet|French]], [[German alphabet|German]], [[Italian alphabet|Italian]], [[Vietnamese language|Vietnamese]], [[Latin alphabet|Latin]] itself, and several other languages. Some languages make use of multiple alternate writing systems and thus also use several scripts; for example, in [[Turkish language|Turkish]], the [[Ottoman Turkish alphabet|Arabic]] script was used before the 20th century but transitioned to Latin in the early part of the 20th century. More or less complementary to scripts are [[Unicode symbols|symbols]] and Unicode [[control character]]s.
 
The unified [[Combining Diacritical Marks for Symbols|diacritical character]]s and unified [[General Punctuation|punctuation characters]] frequently have the "common" or "inherited" script property. However, the individual scripts often have their own [[punctuation]] and [[diacritic]]s, so that many scripts include not only letters but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and [[Space (punctuation)|space]] characters.
 
Unicode {{Unicode version|version=1516.10}} defines 161168 separate scripts, including 9499 modern scripts and 6769 ancient or historic scripts.<ref>{{cite web|url=https://www.unicode.org/Public/UNIDATA/Scripts.txt|title=Unicode Character Database: Scripts|website=unicode.org}}</ref><ref>{{cite book | title = The Unicode Standard, Version 15.0 | chapter = Chapter 14: Additional Ancient and Historic Scripts | publisher = Unicode, Inc | date = September 2022 | ___location = Mountain View, CA | url = https://www.unicode.org/versions/Unicode15.0.0/ch14.pdf | isbn = 978-1-936213-32-0 }}</ref> More scripts are in the process for encoding or have been tentatively allocated for encoding in roadmaps.<ref>https://www.unicode.org/roadmaps/ Roadmaps to Unicode</ref>
 
== Definition and classification ==
Line 30:
 
== Character categories within scripts ==
Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letterletters and modifier letters. Some characters are considered titlecase letters for a few [[Precomposed character|precomposed]] ligatures such as Dz (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all [[Unicode compatibility characters |compatibility characters]], and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.
 
Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as "other letter" or "modifier letter". Ideographs such as Unihan ideographs are also categorized as "other letters". A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are neither uppercase nor lowercase.
Line 36:
Scripts can also contain any other general category character such as '''marks''' (diacritic and otherwise), '''numbers''' (numerals), '''punctuation''', '''separators''' (word separators such as spaces), '''symbols''' and non-graphical '''format''' characters. These are included in a particular script when they are unique to that script. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.
 
== <span class="anchor" id="List of scripts in Unicode"></span> List of encoded scripts ==
{{As of|September 2024|alt=As of version 16.0}}, Unicode defines 168 scripts (called "Alias" or "Property value alias") based on the ISO 15924 list. In addition, Unicode assigns the name "Common" to ISO 15924's {{code|Zyyy}} code for undetermined scripts, "Inherited" to ISO 15924's {{code|Zinh}} code for inherited scripts, and "Unknown" to ISO 15924's {{code|Zzzz}} code for uncoded scripts. There are script codes defined by ISO 15924 but are not used in Unicode, including {{code|Zsym}} (Symbols) and {{code|Zmth}} (Mathematical notation).
Unicode defines over a hundred script names (called "Alias" or "Property value alias"), based on the ISO 15924 list.
Unicode uses the "Common" script name for ISO 15924's Zyyy (code for undetermined script), "Inherited" for ISO 15924's Zinh (code for inherited script), and "Unknown" for ISO 15924's Zzzz (code for uncoded script). Not used are, among others, the ISO 15924 script codes: Zsym (Symbols) and Zmth (Mathematical notation). These are considered not to be scripts in Unicode sense.
{{ISO 15924 script codes and related Unicode data|state=uncollapsed}}
 
== Missing scripts in Unicode ==
The project Missing Scripts—with contributors from the [[Mainz University of Applied Sciences]], the L’Atelier national de recherche typographique (ANRT) in [[Nancy, France|Nancy]], and the [[University of California, Berkeley]]—has compiled a list of 131 scripts that have not yet been encoded in ''The Unicode Standard'', out of a total of 294 recognized scripts according to the current state of research.<ref>{{Cite web |title=The World's Writing Systems |url=https://www.worldswritingsystems.org/ |access-date=2024-10-04 |website=www.worldswritingsystems.org}}</ref>
{{norefs|section|date=April 2024}}
With each new version of Unicode, new writing systems are added to the international character code. According to a statement by linguist Dr Deborah Anderson of UC Berkeley, there are over 100 writing systems that have not yet been included in Unicode.
 
According to a list of the project Missing Scripts by the University of Applied Sciences Mainz, Germany, the ANRT Nancy, France and UC Berkeley, USA, there are 294 known writing systems of mankind according to the current state of research (January 2022). 131 of them have not yet been encoded in Unicode, i.e. cannot yet be used on a computer or mobile phone.
 
==See also==
Line 57 ⟶ 53:
 
==External links==
* [https://linguisticssei.berkeley.edu/sei/index.html Script Encoding Initiative], A project at UC Berkeley, USA, working to get more scripts included in the Unicode standard.
* [https://www.worldswritingsystems.org The World’s Writing Systems], An overview of all 294 known writing systems, each with a typographic reference glyph and their Unicode status.