Japanese language and computers: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{When}}
+use of half-width kana and gaiji, recent usage share of encodes in Internet
Line 4:
 
==Character encodings==
There are several standard methods to [[character encoding|encode]] Japanese characters for use on a computer, including [[JIS encoding|JIS]], [[Shift-JIS]], [[Extended Unix Code|EUC]], and [[Unicode]]. While mapping the set of [[kana]] is a simple matter, [[kanji]] has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards were used by 2000s. As of 2017, the usage share of [[UTF-8]] on the Internet has expanded to over 90 % worldwide, and rest of 1.2% used Shift-JIS and EUC. But, a few popular websites including [[2channel]] and [[kakaku.com]] are still inusing useShift-JIS.<ref>{{Cite todayweb|url=https://internet.watch.impress.co.jp/docs/yajiuma/1086378.html|title=【やじうまWatch】 ウェブサイトにおける文字コードの割合、UTF-8が90%超え。Shift_JISやEUC-JPは? - INTERNET Watch|date=2017-10-17|website=INTERNET Watch|access-date=2019-05-11}}</ref>
{{Update|section|reason=The info about Unicode adoption is sorely out of date|date=January 2018}}
There are several standard methods to [[character encoding|encode]] Japanese characters for use on a computer, including [[JIS encoding|JIS]], [[Shift-JIS]], [[Extended Unix Code|EUC]], and [[Unicode]]. While mapping the set of [[kana]] is a simple matter, [[kanji]] has proven more difficult. Despite efforts, none of the encoding schemes have become the de facto standard, and multiple encoding standards are still in use today.
 
For example, most Japanese [[email]]s arewere in [[ISO-2022-JP]] ("JIS encoding") and [[web page]]s in [[Shift-JIS]] and yet mobile phones in Japan usually useused some form of [[Extended Unix Code]]. If a program fails to determine the encoding scheme employed, it can cause {{Nihongo3|"misconverted garbled/garbage characters"|文字化け|''[[mojibake]]''|literally "transformed characters"}} and thus unreadable text on computers.
[[File:PC-9801F Kanji ROM board.jpg|thumb|Kanji [[Read-only memory|ROM]] card installed in [[PC-9800 series|PC-98]], which stored about 3000 glyphs, and enabled to display them in fast. It also had a [[Random-access memory|RAM]] to store gaiji.]]
[[File:Control panel of public background music system.jpg|thumb|Display of background music system using [[half-width kana]]]]
The first encoding to become widely used was [[JIS X 0201]], which is a [[ISO 646|single-byte encoding]] that only covers standard 7-bit ASCII characters with [[Half-width kana|half-width katakana]] extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers) because kanji input required a complicated process, and kanji display required much memory and high resolution. This means that only [[katakana]], not kanji, was supported using this technique. Some embedded displays still have this limitation.
 
The development of kanji encodings was the beginning of the split. [[Shift JIS]] supports kanji and was developed to be completely backward compatible with [[JIS X 0201]], and thus is in much embedded electronic equipment.
The first encoding to become widely used was [[JIS X 0201]], which is a [[ISO 646|single-byte encoding]] that only covers standard 7-bit ASCII characters with half-width katakana extensions. This was widely used in systems that were neither powerful enough nor had the storage to handle kanji (including old embedded equipment such as cash registers). This means that only [[katakana]], not kanji, was supported using this technique. Some embedded displays still have this limitation.
 
The development of kanji encodings was the beginning of the split. [[Shift JIS]] supports kanji and was developed to be completely backward compatible with [[JIS X 0201]], and thus is in much embedded electronic equipment.
 
However, [[Shift JIS]] has the unfortunate property that it often breaks any parser (software that reads the coded text) that is not specifically designed to handle it. For example, a text search method can get false hits if it is not designed for Shift JIS. [[Extended Unix Code|EUC]], on the other hand, is handled much better by parsers that have been written for 7-bit ASCII (and thus [[Extended Unix Code|EUC]] encodings are used on UNIX, where much of the file-handling code was historically only written for English encodings). But EUC is not backwards compatible with JIS X 0201, the first main Japanese encoding. Further complications arise because the original Internet e-mail standards only support 7-bit transfer protocols. Thus {{IETF RFC|1468}} ("[[ISO-2022-JP]]", often simply called [[JIS encoding]]) was developed for sending and receiving e-mails.
[[File:Japanese TV closed caption using gaiji.jpg|thumb|[[Gaiji]] is used in closed caption of Japanese TV broadcasting]]
In [[character set]] standards such as [[JIS X 0208|JIS]], not all required characters are included, so [[gaiji]] ({{lang|ja|外字}} "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in [[Internet]] environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be encoded using a larger character set (such as Unicode) that supports the required character.
 
[[Unicode]] was intended to solve all encoding problems over all languages. The [[UTF-8]] encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software, and it eliminates the need for gaiji. There are still controversies, however. For Japanese, the kanji characters have been [[Han unification|unified]] with Chinese; that is, a character considered to be the same in both Japanese and Chinese is given a single number, even if the appearance is actually somewhat different, with the precise appearance left to the use of a locale-appropriate font. This process, called [[Han unification]], has caused controversy. The previous encodings in Japan, [[Free area of the Republic of China|Taiwan Area]], [[Mainland China]] and [[Korea]] have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas. As of 2011, Unicode iswas slowly growing because it iswas better supported by software from outside Japan, but still (as of 2011) most web pages in Japanese useused Shift-JIS. The [[Japanese Wikipedia]] uses Unicode.
In [[character set]] standards such as [[JIS X 0208|JIS]], not all required characters are included, so [[gaiji]] ({{lang|ja|外字}} "external characters") are sometimes used to supplement the character set. Gaiji may come in the form of external font packs, where normal characters have been replaced with new characters, or the new characters have been added to unused character positions. However, gaiji are not practical in [[Internet]] environments since the font set must be transferred with text to use the gaiji. As a result, such characters are written with similar or simpler characters in place, or the text may need to be encoded using a larger character set (such as Unicode) that supports the required character.
 
[[Unicode]] was intended to solve all encoding problems over all languages. The [[UTF-8]] encoding used to encode Unicode in web pages does not have the disadvantages that Shift-JIS has. Unicode is supported by international software, and it eliminates the need for gaiji. There are still controversies, however. For Japanese, the kanji characters have been [[Han unification|unified]] with Chinese; that is, a character considered to be the same in both Japanese and Chinese is given a single number, even if the appearance is actually somewhat different, with the precise appearance left to the use of a locale-appropriate font. This process, called [[Han unification]], has caused controversy. The previous encodings in Japan, [[Free area of the Republic of China|Taiwan Area]], [[Mainland China]] and [[Korea]] have only handled one language and Unicode should handle all. The handling of Kanji/Chinese have however been designed by a committee composed of representatives from all four countries/areas. Unicode is slowly growing because it is better supported by software from outside Japan, but still (as of 2011) most web pages in Japanese use Shift-JIS. The [[Japanese Wikipedia]] uses Unicode.
 
== Text input ==