Script (Unicode)

This is an old revision of this page, as edited by BabelStone (talk | contribs) at 17:59, 1 October 2009 (Updating for Unicode 5.2 which has just been released). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Template:Unicode scripts In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems.[1] For example the Latin script supports alphabets such as: English, French, Vietnamese and many others. Some scripts support one and only one writing system and language, for example: Armenian. Other scripts, like Latin, support many different writing systems: English, French, German, Italian, and Latin to name just some of the alphabets supported by the Latin script. Some languages also make use of multiple alternate writing systems. Turkish, for example, used Arabic script before the 20th century and transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system.

When multiple languages make use of the same script, there are frequently some differences: particularly in diacritics and other marks. For example, Swedish and English both use the Latin script. However, Swedish includes the character ‘å’ (sometimes called a “Swedish O”) while English has no such character. Nor does English make use of the diacritic combining circle above for any character. In general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script. So the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks and collation algorithms.

While all characters have the property of belonging to a script, many characters, such as symbols, indicate “common” or “inherited” for their script property. The unified diacritical characters and unified punctuation characters frequently have the “common” or “inherited” script property. However, the individual scripts often have their own punctuation and diacritics. So many scripts include not only letters, but also diacritic and other marks, punctuation, numerals and even their own idiosyncratic symbols and space characters.

Unicode 5.2 includes 90 modern and historic scripts supporting hundreds or even thousands of languages throughout the World. Unicode is actively working on many more as indicated by its roadmap.

Writing system

Writing system is sometimes treated as a synonym for script. However it also can be used as the specific concrete writing system supported by a script. For example the Vietnamese writing system is supported by the Latin script. A writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts.

Most writing systems can be broadly divided into several categories: logographic, syllabic, alphabetic (or segmental), abugida, abjad and featural; however, all features of any of these may be found in any given writing system in varying proportions, often making it difficult to purely categorize a system. The term complex system is sometimes used to describe those where the admixture makes classification problematic.

Type of writing system What each symbol represents Example
Logographic morpheme Chinese characters
Syllabic syllable Japanese kana
Alphabetic phoneme (consonant or vowel) Latin alphabet
Abugida phoneme (consonant+vowel) Indian Devanāgarī
Abjad phoneme (consonant) Arabic alphabet
Featural phonetic feature Korean hangul

See also: phonemic and phonetic orthography.

Unicode supports all of these types of writing systems through its numerous scripts. Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms.

Table of Unicode scripts

The following table lists the 90 scripts that are defined in Unicode 5.2.[2]

Unicode script name Relevant Wikipedia article(s) ISO 15924 code[3] Number of characters (as of Unicode 5.2) Version of Unicode first encoded
Common Zyyy 5,395
Inherited Qaai 523
Arabic Arabic alphabet Arab 1,030 1.0
Armenian Armenian alphabet Armn 90 1.0
Avestan Avestan alphabet Avst 61 5.2
Balinese Balinese script Bali 121 5.0
Bamum Bamum language Bamu 88 5.2
Bengali Bengali script Beng 92 1.0
Bopomofo Zhuyin Bopo 65 1.0
Braille Braille Brai 256 3.0
Buginese Lontara script Bugi 30 4.1
Buhid Buhid script Buhd 20 3.2
Canadian Aboriginal Canadian Aboriginal syllabics Cans 710 3.0
Carian Carian script Cari 49 5.1
Cham Cham alphabet Cham 83 5.1
Cherokee Cherokee syllabary Cher 85 3.0
Coptic Coptic alphabet Copt 135 1.0 (disunified from Greek in 4.1)
Cuneiform Cuneiform script Xsux 982 5.0
Cypriot Cypriot syllabary Cprt 55 4.0
Cyrillic Cyrillic alphabet Cyrl 404 1.0
Deseret Deseret alphabet Dsrt 80 3.1
Devanagari Devanagari script Deva 140 1.0
Egyptian Hieroglyphs Egyptian hieroglyphs Egyp 1,071 5.2
Ethiopic Ge'ez alphabet Ethi 461 3.0
Georgian Georgian alphabet Geor 120 1.0
Glagolitic Glagolitic alphabet Glag 94 4.1
Gothic Gothic alphabet Goth 27 3.1
Greek Greek alphabet Grek 511 1.0
Gujarati Gujarati script Gujr 83 1.0
Gurmukhi Gurmukhi script Guru 79 1.0
Han Chinese character, Kanji, Hanja, Hán tự Hani 75,738 1.0
Hangul Hangul Hang 11,737 1.0 (Hangul syllables relocated in 2.0)
Hanunoo Hanunó'o script Hano 21 3.2
Hebrew Hebrew alphabet Hebr 133 1.0
Hiragana Hiragana Hira 90 1.0
Imperial Aramaic Aramaic language Armi 31 5.2
Inscriptional Pahlavi Pahlavi scripts Phli 27 5.2
Inscriptional Parthian Parthian language Prti 30 5.2
Javanese Javanese script Java 91 5.2
Kaithi Kaithi Kthi 66 5.2
Kannada Kannada script Knda 84 1.0
Katakana Katakana Kana 299 1.0
Kayah Li Kayah Li script Kali 48 5.1
Kharoshthi Kharoṣṭhī Khar 65 4.1
Khmer Khmer script Khmr 146 3.0
Lao Lao script Laoo 65 1.0
Latin Latin alphabet Latn 1,244 1.0
Lepcha Lepcha script Lepc 74 5.1
Limbu Limbu script Limb 66 4.0
Linear B Linear B Linb 211 4.0
Lisu Fraser alphabet Lisu 48 5.2
Lycian Lycian script Lyci 29 5.1
Lydian Lydian script Lydi 27 5.1
Malayalam Malayalam script Mlym 95 1.0
Meetei Mayek Meitei Mayek script Mtei 56 5.2
Mongolian Mongolian script, Clear script, Manchu alphabet Mong 153 3.0
Myanmar Burmese script Mymr 188 3.0
N'Ko N'Ko Nkoo 59 5.0
New Tai Lue New Tai Lue Talu 83 4.1
Ogham Ogham Ogam 29 3.0
Ol Chiki Ol Chiki script Olck 48 5.1
Old Italic Old Italic alphabet Ital 35 3.1
Old Persian Old Persian cuneiform script Xpeo 50 4.1
Old South Arabian South Arabian alphabet Sarb 32 5.2
Old Turkic Old Turkic script Orkh 73 5.2
Oriya Oriya script Orya 84 1.0
Osmanya Osmanya script Osma 40 4.0
Phags-pa 'Phags-pa script Phag 56 5.0
Phoenician Phoenician alphabet Phnx 29 5.0
Rejang Rejang script Rjng 37 5.1
Runic Runic alphabet Runr 78 3.0
Samaritan Samaritan script Samr 61 5.2
Saurashtra Saurashtra script Saur 81 5.1
Shavian Shavian alphabet Shaw 48 4.0
Sinhala Sinhala script Sinh 80 3.0
Sundanese Sundanese script Sund 55 5.1
Syloti Nagri Sylheti Nagari Sylo 44 4.1
Syriac Syriac alphabet Syrc 77 3.0
Tagalog Baybayin Tglg 20 3.2
Tagbanwa Tagbanwa script Tagb 18 3.2
Tai Le Tai Nüa language Tale 35 4.0
Tai Tham Tai Tham script Lana 127 5.2
Tai Viet Tai Viet script Tavt 72 5.2
Tamil Tamil script Taml 72 1.0
Telugu Telugu script Telu 93 1.0
Thaana Tāna Thaa 50 3.0
Thai Thai alphabet Thai 86 1.0
Tibetan Tibetan script Tibt 201 1.0 (removed in 1.1 and reintroduced in 2.0)
Tifinagh Tifinagh Tfng 55 4.1
Ugaritic Ugaritic alphabet Ugar 31 4.0
Vai Vai syllabary Vaii 300 5.1
Yi Yi script Yiii 1,220 3.0

Common and inherited scripts

Unicode assigns every character in the UCS to a single script only. However, many characters — those that are not part of a formal natural language writing system or are unified across many writing systems (e.g. most symbols including music notation, currency signs, etc., as well as some numerals and many punctuation marks) — may be used in more than one script. In these cases Unicode defines them as belonging to the common script.

In addition, many diacritics and non-spacing combining characters may be applied to characters from more than one script, and in these cases Unicode assigns them to the inherited script, which means that they have the same script class as the base character with which they combine, and so in different contexts they may be treated as belonging to different scripts. For example, U+0308 Combining Diaeresis may combine with either U+0065 Latin Small Letter E (ë) or U+0435 Cyrillic Small Letter IE (ё), and in the former case it inherits the Latin script of the preceding base character whereas in the latter case it inherits the Cyrillic script of the preceding base character.

Character categories within scripts

Template:UCS characters Unicode provides a general category property for each character. So in addition to belonging to a script every character also has a general category. Typically scripts include letter characters including: uppercase letters, lowercase letter and modifier letters. Some characters are considered titlecase letters for a few precomposed ligatures such as Dz (U+01F2). Such titlecase ligatures are all in the Latin and Greek scripts and are all compatibility characters and therefore Unicode discourages their use by authors. It is unlikely that new titlecase letters will be added in the future.

Most writing systems do not differentiate between uppercase and lowercase letters. For those scripts all letters are categorized as “other letter” or “modifier letter”. Ideographs such as Unihan ideographs are also categorized as “other letters”. A few scripts do differentiate between uppercase and lowercase however: Latin, Cyrillic, Greek, Armenian, Georgian, and Deseret. Even for these scripts there are some letters that are nether uppercase nor lowercase.

Scripts can also contain any other general category character such as marks (diacritic and otherwise), numbers (numerals), punctuation, separators (word separators such as spaces), symbols and non-graphical format characters. These are included in a particular script when they are unique to that scripts. Other such characters are generally unified and included in the punctuation or diacritic blocks. However, the bulk of characters in any script (other than the common and inherited scripts) are letters.

See also

References