Script (Unicode)

Template:Unicode scripts In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.^[1] For example the Latin script supports alphabets such as: English, French, Vietnamese and many others. Some scripts support one and only one writing system and language, for example: Armenian. Other scripts, like Latin, support many different writing systems: English, French, German, Italian, and Latin to name just some of the alphabets supported by the Latin script. Some languages also make use of multiple alternate writing systems. Turkish, for example, used Arabic script before the 20th century and transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

While all characters have the property of belonging to a script, many characters, such as symbols, indicate “common” or “inherited” for their script property. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode 5.2 includes 90 modern and historic scripts supporting hundreds or even thousands of languages throughout the World. Unicode is actively working on many more as indicated by its roadmap.

Writing system

Writing system is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.

Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.

Type of writing system	What each symbol represents	Example
Logographic	morpheme	Chinese characters
Syllabic	syllable	Japanese kana
Alphabetic	phoneme (consonant or vowel)	Latin alphabet
Abugida	phoneme (consonant+vowel)	Indian Devanāgarī
Abjad	phoneme (consonant)	Arabic alphabet
Featural	phonetic feature	Korean hangul

Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.

Table of Unicode scripts

The following table lists the 90 scripts that are defined in Unicode 5.2.^[2]

Unicode script name	Relevant Wikipedia article(s)	ISO 15924 code^[3]	Number of characters (as of Unicode 5.2)	Version of Unicode first encoded
Common		Zyyy	5,395
Inherited		Qaai	523
Arabic	Arabic alphabet	Arab	1,030	1.0
Armenian	Armenian alphabet	Armn	90	1.0
Avestan	Avestan alphabet	Avst	61	5.2
Balinese	Balinese script	Bali	121	5.0
Bamum	Bamum language	Bamu	88	5.2
Bengali	Bengali script	Beng	92	1.0
Bopomofo	Zhuyin	Bopo	65	1.0
Braille	Braille	Brai	256	3.0
Buginese	Lontara script	Bugi	30	4.1
Buhid	Buhid script	Buhd	20	3.2
Canadian Aboriginal	Canadian Aboriginal syllabics	Cans	710	3.0
Carian	Carian script	Cari	49	5.1
Cham	Cham alphabet	Cham	83	5.1
Cherokee	Cherokee syllabary	Cher	85	3.0
Coptic	Coptic alphabet	Copt	135	1.0 (disunified from Greek in 4.1)
Cuneiform	Cuneiform script	Xsux	982	5.0
Cypriot	Cypriot syllabary	Cprt	55	4.0
Cyrillic	Cyrillic alphabet	Cyrl	404	1.0
Deseret	Deseret alphabet	Dsrt	80	3.1
Devanagari	Devanagari script	Deva	140	1.0
Egyptian Hieroglyphs	Egyptian hieroglyphs	Egyp	1,071	5.2
Ethiopic	Ge'ez alphabet	Ethi	461	3.0
Georgian	Georgian alphabet	Geor	120	1.0
Glagolitic	Glagolitic alphabet	Glag	94	4.1
Gothic	Gothic alphabet	Goth	27	3.1
Greek	Greek alphabet	Grek	511	1.0
Gujarati	Gujarati script	Gujr	83	1.0
Gurmukhi	Gurmukhi script	Guru	79	1.0
Han	Chinese character, Kanji, Hanja, Hán tự	Hani	75,738	1.0
Hangul	Hangul	Hang	11,737	1.0 (Hangul syllables relocated in 2.0)
Hanunoo	Hanunó'o script	Hano	21	3.2
Hebrew	Hebrew alphabet	Hebr	133	1.0
Hiragana	Hiragana	Hira	90	1.0
Imperial Aramaic	Aramaic language	Armi	31	5.2
Inscriptional Pahlavi	Pahlavi scripts	Phli	27	5.2
Inscriptional Parthian	Parthian language	Prti	30	5.2
Javanese	Javanese script	Java	91	5.2
Kaithi	Kaithi	Kthi	66	5.2
Kannada	Kannada script	Knda	84	1.0
Katakana	Katakana	Kana	299	1.0
Kayah Li	Kayah Li script	Kali	48	5.1
Kharoshthi	Kharoṣṭhī	Khar	65	4.1
Khmer	Khmer script	Khmr	146	3.0
Lao	Lao script	Laoo	65	1.0
Latin	Latin alphabet	Latn	1,244	1.0
Lepcha	Lepcha script	Lepc	74	5.1
Limbu	Limbu script	Limb	66	4.0
Linear B	Linear B	Linb	211	4.0
Lisu	Fraser alphabet	Lisu	48	5.2
Lycian	Lycian script	Lyci	29	5.1
Lydian	Lydian script	Lydi	27	5.1
Malayalam	Malayalam script	Mlym	95	1.0
Meetei Mayek	Meitei Mayek script	Mtei	56	5.2
Mongolian	Mongolian script, Clear script, Manchu alphabet	Mong	153	3.0
Myanmar	Burmese script	Mymr	188	3.0
N'Ko	N'Ko	Nkoo	59	5.0
New Tai Lue	New Tai Lue	Talu	83	4.1
Ogham	Ogham	Ogam	29	3.0
Ol Chiki	Ol Chiki script	Olck	48	5.1
Old Italic	Old Italic alphabet	Ital	35	3.1
Old Persian	Old Persian cuneiform script	Xpeo	50	4.1
Old South Arabian	South Arabian alphabet	Sarb	32	5.2
Old Turkic	Old Turkic script	Orkh	73	5.2
Oriya	Oriya script	Orya	84	1.0
Osmanya	Osmanya script	Osma	40	4.0
Phags-pa	'Phags-pa script	Phag	56	5.0
Phoenician	Phoenician alphabet	Phnx	29	5.0
Rejang	Rejang script	Rjng	37	5.1
Runic	Runic alphabet	Runr	78	3.0
Samaritan	Samaritan script	Samr	61	5.2
Saurashtra	Saurashtra script	Saur	81	5.1
Shavian	Shavian alphabet	Shaw	48	4.0
Sinhala	Sinhala script	Sinh	80	3.0
Sundanese	Sundanese script	Sund	55	5.1
Syloti Nagri	Sylheti Nagari	Sylo	44	4.1
Syriac	Syriac alphabet	Syrc	77	3.0
Tagalog	Baybayin	Tglg	20	3.2
Tagbanwa	Tagbanwa script	Tagb	18	3.2
Tai Le	Tai Nüa language	Tale	35	4.0
Tai Tham	Tai Tham script	Lana	127	5.2
Tai Viet	Tai Viet script	Tavt	72	5.2
Tamil	Tamil script	Taml	72	1.0
Telugu	Telugu script	Telu	93	1.0
Thaana	Tāna	Thaa	50	3.0
Thai	Thai alphabet	Thai	86	1.0
Tibetan	Tibetan script	Tibt	201	1.0 (removed in 1.1 and reintroduced in 2.0)
Tifinagh	Tifinagh	Tfng	55	4.1
Ugaritic	Ugaritic alphabet	Ugar	31	4.0
Vai	Vai syllabary	Vaii	300	5.1
Yi	Yi script	Yiii	1,220	3.0

Common and inherited scripts

Unicode assigns every character in the UCS to a single script only. However, many characters — those that are not part of a formal natural language writing system or are unified across many writing systems (e.g. most symbols including music notation, currency signs, etc., as well as some numerals and many punctuation marks) — may be used in more than one script. In these cases Unicode defines them as belonging to the common script.

In addition, many diacritics and non-spacing combining characters may be applied to characters from more than one script, and in these cases Unicode assigns them to the inherited script, which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308 Combining Diaeresis may combine with either U+0065 Latin Small Letter E (ë) or U+0435 Cyrillic Small Letter IE (ё), and in the former case it inherits the Latin script of the preceding base character whereas in the latter case it inherits the Cyrillic script of the preceding base character.

Character categories within scripts

Template:UCS characters Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as ǲ (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.

Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as “other letter” or “modifier letter”. Ideographs such as Unihan ideographs are also categorized as “other letters”. A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are nether uppercase nor lowercase.

Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that scripts. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.

References

[1] Glosary of Unicode Terms

[2] Unicode Character Database : Scripts

[3] ISO 15924 Registration Authority

[1]

[2]

[3]