Code page 932 (Microsoft Windows): Difference between revisions

Content deleted Content added
No edit summary
Double-byte character differences: quote 'because' and 'not' where they're literals
 
(42 intermediate revisions by 13 users not shown)
Line 1:
{{Short description|Windows character set for Japanese}}
{{aboutAbout|Microsoft's Code Page 932 and IBM's Code Page 943|IBM's Code Page 932|Code page 932 (IBM)}}
{{redirectRedirect|Windows-31J|the operating system version|Windows 3.1J}}
{{infoboxInfobox character encoding
| name = Windows Code page 932
| mime = Windows-31J
| alias = CP943C
| standard = [[WHATWG Encoding Standard]] (as "Shift_JIS")<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>
| standard = WHATWG Encoding Standard (as "Shift_JIS")
| lang = [[Japanese language|Japanese]]
| status =
Line 11 ⟶ 12:
| prev =
| next =
| classification = [[Extended ASCII]],{{efn|Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.}} [[Variablevariable-width encoding]], [[CJK characters|CJK encoding]]
| extra = <div style="text-align: left;">{{notelist}}</div>
}}
 
'''Microsoft Windows code page 932''' (abbreviated '''MS932''',<ref>{{cite web | url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=27851 | title=Bug 27851 - Add MS932 as a label of Shift_JIS | work=w3.org Bug Tracker | author=Sivonen, Henri}}</ref><ref name="icuwindows31j" /> '''Windows-932'''<ref name="icuwindows31j">{{cite web | url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=- | title=Converter Explorer: ibm-943_P15A-2003 (alias windows-31j) | work=International Components for Unicode: ICU Demonstration}}</ref> or ambiguously '''CP932'''<ref>{{cite web|url=https://www.debian.org/doc/manuals/debian-reference/ch11.en.html|title=Chapter 11. Data conversion|work=Debian Reference|last=Aoki|first=Osamu|publisher=Debian}}</ref>), also called '''Windows-31J''' amongst other names (see [[#Terminology|§ Terminology]] below), is the [[Microsoft Windows]] [[code page]] for the [[Japanese language]], which is an extended variant of the [[Shift JIS]] Japanese [[character encoding]]. It contains standard 7-bit [[ASCII]] codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding.
 
IBM offer the same extended double-byte codes in their '''[[code page]] 943''' ('''IBM-943''' or '''CP943'''),<ref name="ibm932v943">{{cite web | url=https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.nlsgdrf/ibm-943_ibm-932.htm | title=IBM-943 and IBM-932 | publisher=IBM | work=IBM Knowledge Center}}</ref> which is a combination of the single-byte [[Code page 897]] and the double-byte '''Code page 941'''.<ref name="ibm943">{{cite web | url=http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html | title=CodeCoded Pagecharacter set identifiers - CCSID 943 | publisher=IBM | work=IBM Globalization | archive-url=https://web.archive.org/web/20160315110642/http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html | archive-date=2016-03-15}}</ref>
 
Windows-31J is the most used non-[[UTF-8]]/Unicode Japanese encoding on the web. However, many people and software packages, including Microsoft libraries,<ref name="msdnlabels"/> declare the {{nowrap|[[Shift JIS]]}} encoding for Windows-31J data, although it includes some additional characters, and some of the existing characters are mapped to [[Unicode]] differently. This has led the WHATWG HTML standard to treat the encoding labels {{code|shift_jis}} and {{code|windows-31j}} interchangeably, and use the Windows variant for its "Shift_JIS" encoder and decoder.<ref name="encoding_rs"/><!-- Per W3C / WHATWG standards, the labels Shift_JIS and Windows-31J are treated the same; the W3C/WHATWG spec uses the Shift JIS name, but its definition actually matches Windows-31J (not JIS X 0208 Appendix 1). -->
== Terminology ==
 
== Terminology ==
Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous as [[IBM-932|IBM's code page 932]], while also a Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208.<ref name="ibm932v943" />
 
IBM's code page 943 (or "IBM-943") includes the same double byte codes as Windows code page 932.<ref name="ibm932v943" /> Microsoft's version corresponds closely to the encoding referred to as '''ibm-943_P15A-2003''' (with aliases including '''CP943C''' and '''Windows-932''')<ref name="icuwindows31j" /> in [[International Components for Unicode]] (ICU). There is also a second ICU encoding named '''ibm-943_P130-1999''',<ref name="icuibm943" /> which uses different single-byte mappings which more closely match IBM's code page definitions. (See [[#Single-byte character differences|§ Single-byte character differences]] below for details.)
 
Windows code page 932 is registered with the [[Internet Assigned Numbers Authority|IANA]] as '''Windows-31J'''.<ref name="iana31j">{{cite web | url=https://www.iana.org/assignments/character-sets/character-sets.xhtml | publisher=IANA | title=Character Sets}}</ref> The "Windows-31J" label is IANA's and not recognized by Microsoft, which has historically used "shift_jis" instead.<ref name="msdnlabels">{{cite web|url=https://msdn.microsoft.com/en-us/library/system.text.encoding.windowscodepage(v=vs.110).aspx |title=Encoding.WindowsCodePage Property - .NET Framework (current version) |work=MSDN |publisher=Microsoft}}</ref> The [[W3C]]/[[WHATWG]] encoding standard used by [[HTML5]] treats the label "'''shift_jis'''" interchangeably with "windows-31j" with the intent of being "compatible with deployed content"<ref>{{cite web | url=https://encoding.spec.whatwg.org/#names-and-labels | title=4.2. Names and labels | publisher=WHATWG | work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and matches Windows code page 932<ref name="encoding_rs"/> (including the "formerly proprietary extensions from IBM and NEC").<ref>{{cite web | url=https://encoding.spec.whatwg.org/#index-jis0208 | title=5. Indexes (§ Index jis0208) | publisher=WHATWG | work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>
 
Windows code page 932 is also called '''MS_Kanji''',<ref name="icuwindows31j" /><ref name="python">{{cite web | url=https://docs.python.org/3.6/library/codecs.html#standard-encodings | title=7.2.3. Standard Encodings | publisher=Python Software Foundation | work=Python 3.6 Documentation | accessdateaccess-date=19 September 2017}}</ref> although IANA treat MS_Kanji as an alias for standard Shift JIS.<ref name="iana31j"/> [[Python (programming language)|Python]], for example, uses the label <code>MS-Kanji</code> (or <code>cp932</code>) for Windows-932 and the label <code>Shift_JIS</code> (or <code>sjis</code>) for JIS X 0208-defined Shift JIS, without recognising the <code>Windows-31J</code> label.<ref name="python" />
 
In Japanese editions of Windows, this code page is [[Windows code page#ANSI code page|referred to as "ANSI"]], since it is the operating system's default 8-bit encoding, even though [[ANSI]] was not involved in its definition.
 
== Differences from standard Shift JIS ==
 
== Differences from standard Shift JIS ==
Windows-31J is often mistaken for standard Shift JIS (as defined in [[JIS X 0208]]:1997 Appendix 1): while similar, the distinction is significant for computer programmers wishing to avoid [[mojibake]].
 
=== Double-byte character differences ===
[[File:Euler diag for jp charsets.svg|thumb|[[Euler diagram]] comparing repertoires of [[JIS X 0208]], [[JIS X 0212]], [[JIS X 0213]], Windows-31J, the Microsoft standard repertoire and [[Unicode]] ]]
In addition to the standard [[JIS X 0201]]:1997 and [[JIS X 0208]]:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "[[JIS X 0208#0x2D|NEC special characters]] (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",<ref name="iana31j" /> in addition to setting some encoding space aside for [[Private Use Areas#Private-use characters in other character sets|end user definition]].<ref>{{cite web | url=http://archives.miloush.net/michkap/archive/2007/05/26/2901371.html | title=The PUA outside of Unicode | author=Kaplan, Michael S | work=Sorting it all out | date=2007-05-26}}</ref> This also differs from [[Code page 932 (IBM)|IBM-932]], which does not include the NEC extensions or NEC selection.<ref name="ibm932v943"/>
 
The IBM extensions were designed to encode characters from the [[Japanese language in EBCDIC#Double-byte codes|IBM Japanese DBCS-Host]] repertoire which were initially absent in JIS X 0208; the [[because sign|'because' sign]] ∵ and [[not sign|'not' sign]] ¬ were later added to JIS X 0208 itself in 1983, and Microsoft includes them at extension locations as well as their 1983 locations.<ref name="lundeE">{{citation|mode=cs1 |title=Appendix E: Vendor Character Set Standards |work=CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing |last=Lunde |first=Ken |author-link=Ken Lunde |year=2009 |edition=2nd |publisher=[[O'Reilly Media|O'Reilly]] |___location=[[Sebastopol, CA]] |isbn=978-0-596-51447-1 |url=https://resources.oreilly.com/examples/9780596514471/blob/master/cjkvip2e-appE.pdf}}</ref> The NEC extensions also encode the entirety of the IBM repertoire, but in a separate extension within the 94×94 JIS X 0208 grid (in rows 89–92, besides the characters already included in [[JIS X 0208#0x2D|NEC row 13]]), rather than using Shift JIS codes beyond the JIS X 0208 range; Windows code page 932 includes these 388 characters in both locations.<ref name="lundeE"/> As a result, the 'because' and 'not' signs are encoded three times.
In addition to the standard [[JIS X 0201]]:1997 and [[JIS X 0208]]:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",<ref name="iana31j" /> in addition to setting some encoding space aside for [[Private Use Areas#Private-use characters in other character sets|end user definition]].<ref>{{cite web | url=http://archives.miloush.net/michkap/archive/2007/05/26/2901371.html | title=The PUA outside of Unicode | author=Kaplan, Michael S | work=Sorting it all out | date=2007-05-26}}</ref> This also differs from [[Code page 932 (IBM)|IBM-932]], which does not include the NEC extensions or NEC selection.<ref name="ibm932v943"/>
 
Some of these representations were subsequently used for different characters by [[JIS X 0213]] and [[Shift JIS-2004]]. For example, compare row 89 in JIS X 0213 (beginning 硃, 硎, 硏…)<ref>{{cite web | url=https://www.itscj.ipsj.or.jp/iso-ir/ |number=233.pdf | title=233: Japanese Graphic Character Set for Information Interchange, Plane 1 |sponsor=Japanese publisherIndustrial Standards Committee |sponsor-link=IPSJJapanese Industrial Standards Committee |date=2004-04-13}}</ref> to row 89 as used by JIS X 0208 with IBM/NEC extensions (beginning 纊, 褜, 鍈…).<ref>{{cite web | url=https://encoding.spec.whatwg.org/jis0208.html | title=Index jis0208 visualization | publisher=WHATWG | work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Consequently, Shift JIS-2004 is not compatible with Windows-31J.
 
In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as the [[wave dash]] being [[Tilde#Unicode and Shift JIS encoding of wave dash|mapped to U+FF5E]] rather than U+301C,<ref name="w3cjpprof">{{cite web | url = https://www.w3.org/TR/japanese-xml/#ambiguity_of_yen | title = Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative) | work = XML Japanese Profile | publisher=W3C}}</ref> which is followed by ibm-943_P15A-2003<ref>{{cite web | url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&b=81&s=ALL#layout | title=Converter Explorer: ibm-943_P15A-2003: start byte 0x81 | publisher=International Components for Unicode | work=ICU Demonstration}}</ref> but not ibm-943_P130-1999,<ref>{{cite web | url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&b=81&s=ALL#layout | title=Converter Explorer: ibm-943_P130-1999: start byte 0x81 | publisher=International Components for Unicode | work=ICU Demonstration}}</ref> and using different mapping for the double byte backslash.<ref name="w3cjpprof" />
 
=== Single-byte character differences ===
Windows-932 includes standard 7-bit [[ASCII]] mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (<code>\</code>, the [[backslash]]) and U+007E [[tilde|TILDE]] (<code>~</code>) respectively,<ref name="msmapping">{{cite web | url=httphttps://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT | title=CP932.TXT | publisher=Unicode Consortium}}</ref><ref name="msrefrender">{{cite web | url=https://msdn.microsoft.com/en-us/library/cc194889.aspx | title=Lead byte NULL — Code page 932 | publisher=Microsoft}}</ref><ref name="w3cjpprof"/> as they are in ASCII ([[ISO 646|ISO-646]]-US). This is likewise done by the W3C/WHATWG encoding standard.<ref>{{cite web | url=https://encoding.spec.whatwg.org/#shift_jis-decoder | title=12.3.1. Shift_JIS decoder | publisher=WHATWG | work=Encoding Standard}} "| quotation=If byte is an ASCII byte or 0x80, return a code point whose value is byte." |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> By contrast, 0x5C is mapped to U+00A5 [[Yen sign|YEN SIGN]] (<code>¥</code>) in [[Code page 895|ISO-646-JP]] and consequently [[JIS X 0201]], of which standard [[Shift JIS]] is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.<ref name="w3cjpprof" />
 
Windows-932 includes standard 7-bit [[ASCII]] mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (<code>\</code>, the [[backslash]]) and U+007E [[tilde|TILDE]] (<code>~</code>) respectively,<ref name="msmapping">{{cite web | url=http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT | title=CP932.TXT | publisher=Unicode Consortium}}</ref><ref name="msrefrender">{{cite web | url=https://msdn.microsoft.com/en-us/library/cc194889.aspx | title=Lead byte NULL — Code page 932 | publisher=Microsoft}}</ref><ref name="w3cjpprof"/> as they are in ASCII ([[ISO 646|ISO-646]]-US). This is likewise done by the W3C/WHATWG encoding standard.<ref>{{cite web | url=https://encoding.spec.whatwg.org/#shift_jis-decoder | title=12.3.1. Shift_JIS decoder | publisher=WHATWG | work=Encoding Standard}} "If byte is an ASCII byte or 0x80, return a code point whose value is byte."</ref> By contrast, 0x5C is mapped to U+00A5 [[Yen sign|YEN SIGN]] (<code>¥</code>) in [[Code page 895|ISO-646-JP]] and consequently [[JIS X 0201]], of which standard [[Shift JIS]] is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.<ref name="w3cjpprof" />
 
However, 0x5C in Windows-932 is nonetheless considered a Yen sign in certain contexts.<ref name="kaplan">{{cite web | title=When is a backslash not a backslash? | date=2005-09-17 | author=Kaplan, Michael S. | url=http://archives.miloush.net/michkap/archive/2005/09/17/469941.html | work=Sorting it all out}}</ref> For this reason, in many Japanese fonts, U+005C is displayed as a Yen symbol, which would normally be represented as U+00A5, rather than as a backslash per Unicode's suggested rendering. U+00A5 is one-way best-fit mapped onto 0x5C in Windows-932. However, code 0x5C in Windows-932 behaves as a reverse solidus (backslash) in all respects (e.g. in [[filename|file paths]] on Windows systems) other than how it is displayed by some fonts,<ref name="kaplan" /> and Microsoft's documentation for Windows-932 displays 0x5C as a backslash.<ref name="msrefrender" /> This mapping<ref name="msmapping" /> corresponds to the encoding named "ibm-943_P15A-2003" in [[International Components for Unicode]] (ICU),<ref name="icuwindows31j" /> except for minor reordering of a few [[C0 control characters]].
 
[[Code page 437|IBM-943]], like [[Code page 932 (IBM)|IBM-932]],<ref name="ibm932v943"/> is a superset of the single-byte [[Code page 897]],<ref name="ibm943"/> which maps 0x5C to the Yen symbol (<code>¥</code>) and 0x7E to the overline (<code>‾</code>),<ref name="cp00897txt">{{cite web | url=ftphttps://ftppublic.softwaredhe.ibm.com/software/globalization/gcoc/attachments/CP00897.txt | title=CP00897.txt | publisher=IBM}}</ref> this is followed by the encoding named "ibm-943_P130-1999" in ICU.<ref name="icuibm943">{{cite web | url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943 | work=International Components for Unicode: ICU Demonstration | title=Converter Explorer: ibm-943_P130-1999}}</ref> Code page 897 (and therefore also IBM-943 and IBM-932) also adds single-byte box-drawing characters replacing certain [[C0 control characters]],<ref name="cp00897txt" /> however these may still be treated as control characters depending on the context,<ref>{{cite web | url=http://www-01.ibm.com/software/globalization/cp/cp00897.html | title=Code page identifiers - CP 00897 | publisher=IBM | work=IBM Globalization | url-status=dead | archive-url=https://web.archive.org/web/20160317053427/http://www-01.ibm.com/software/globalization/cp/cp00897.html | archive-date=2016-03-17}}</ref> and are mapped to control characters in ICU.<ref name="icuibm943" />
 
==Layout==
Line 54 ⟶ 57:
 
==See also==
* [[LMBCS-16]]
*[[Code page 942]]
 
==References==
Line 62 ⟶ 64:
==External links==
 
=== Microsoft related ===
*[https://web.archive.org/web/20180405210602/http://msdn.microsoft.com/en-us/library/cc194887.aspx Microsoft's Reference for Windows Code Page 932]
*[httphttps://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt Code page file for MS932]
*[httphttps://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT Mapping of Microsoft's Code Page 932 to Unicode]
*[http://demo.icu-project.org/icu-bin/convexp?conv=windows-31j ICU Code Page 943C (ibm-943_P15A-2003 alias windows-31j) demonstration]
 
=== IBM related ===
*[https://web.archive.org/web/20160315110642/http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html IBM's documentation of Code Page 943]
*[http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943 ICU Code Page 943 (ibm-943_P130-1999) demonstration]
*[httphttps://icuraw.githubusercontent.com/unicode-project.org/repos/icu/datamaster/trunkicu4c/charsetsource/data/ucmmappings/ibm-943_P130-1999.ucm ICU mapping for ibm-943_P130-1999 to Unicode]
{{character encoding}}
 
[[Category:Character sets|932]]
[[Category:Windows code pages|932]]
[[Category:Encodings of Japanese]]