Code page 932 (Microsoft Windows): Difference between revisions

Content deleted Content added
m IBM related: Updated URL.
Copied (comment) from Shift JIS. I.e. Most popular Japanese encoding on the web! There's a catch, but it may also apply outside of the web (I just wouldn't know). Also most popular multi-bute encoding after Chinese "GB2312" (which neither is what it seems to be).
Line 12:
| prev =
| next =
| classification = [[Extended ASCII]],{{efn|Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.}} [[Variablevariable-width encoding]], [[CJK|CJK encoding]]
| extra = <div style="text-align: left;">{{notelist}}</div>
}}
Line 20:
IBM offer the same extended double-byte codes in their '''[[code page]] 943''' ('''IBM-943''' or '''CP943'''),<ref name="ibm932v943">{{cite web | url=https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.nlsgdrf/ibm-943_ibm-932.htm | title=IBM-943 and IBM-932 | publisher=IBM | work=IBM Knowledge Center}}</ref> which is a combination of the single-byte [[Code page 897]] and the double-byte '''Code page 941'''.<ref name="ibm943">{{cite web | url=http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html | title=Coded character set identifiers - CCSID 943 | publisher=IBM | work=IBM Globalization | archive-url=https://web.archive.org/web/20160315110642/http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html | archive-date=2016-03-15}}</ref>
 
Windows-31J is the most used non-[UTF-8]]/Unicode Japanese encoding on the web. Actually {{nowrap|[[Shift JIS]]}} is the much more declared encoding, but by W3C/WHATWG HTML standards, the encodings are declared the same, and while the latter name is used in the standards it's defined to decode the former. See {{nowrap|[[Shift JIS]]}} page for statistics.<!-- Per W3C / WHATWG standards, the labels Shift_JIS and Windows-31J are treated the same; the W3C/WHATWG spec uses the Shift JIS name, but its definition actually matches Windows-31J (not JIS X 0208 Appendix 1). -->
== Terminology ==
 
== Terminology ==
Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous as [[IBM-932|IBM's code page 932]], while also a Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208.<ref name="ibm932v943" />
 
Line 32 ⟶ 33:
In Japanese editions of Windows, this code page is [[Windows code page#ANSI code page|referred to as "ANSI"]], since it is the operating system's default 8-bit encoding, even though [[ANSI]] was not involved in its definition.
 
== Differences from standard Shift JIS ==
 
Windows-31J is often mistaken for standard Shift JIS (as defined in [[JIS X 0208]]:1997 Appendix 1): while similar, the distinction is significant for computer programmers wishing to avoid [[mojibake]].
 
=== Double-byte character differences ===
[[File:Euler diag for jp charsets.svg|thumb|[[Euler diagram]] comparing repertoires of [[JIS X 0208]], [[JIS X 0212]], [[JIS X 0213]], Windows-31J, the Microsoft standard repertoire and [[Unicode]]. ]]
In addition to the standard [[JIS X 0201]]:1997 and [[JIS X 0208]]:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "[[JIS X 0208#0x2D|NEC special characters]] (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",<ref name="iana31j" /> in addition to setting some encoding space aside for [[Private Use Areas#Private-use characters in other character sets|end user definition]].<ref>{{cite web | url=http://archives.miloush.net/michkap/archive/2007/05/26/2901371.html | title=The PUA outside of Unicode | author=Kaplan, Michael S | work=Sorting it all out | date=2007-05-26}}</ref> This also differs from [[Code page 932 (IBM)|IBM-932]], which does not include the NEC extensions or NEC selection.<ref name="ibm932v943"/>
 
Line 44:
In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as the [[wave dash]] being [[Tilde#Unicode and Shift JIS encoding of wave dash|mapped to U+FF5E]] rather than U+301C,<ref name="w3cjpprof">{{cite web | url = https://www.w3.org/TR/japanese-xml/#ambiguity_of_yen | title = Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative) | work = XML Japanese Profile | publisher=W3C}}</ref> which is followed by ibm-943_P15A-2003<ref>{{cite web | url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&b=81&s=ALL#layout | title=Converter Explorer: ibm-943_P15A-2003: start byte 0x81 | publisher=International Components for Unicode | work=ICU Demonstration}}</ref> but not ibm-943_P130-1999,<ref>{{cite web | url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&b=81&s=ALL#layout | title=Converter Explorer: ibm-943_P130-1999: start byte 0x81 | publisher=International Components for Unicode | work=ICU Demonstration}}</ref> and using different mapping for the double byte backslash.<ref name="w3cjpprof" />
 
=== Single-byte character differences ===
 
Windows-932 includes standard 7-bit [[ASCII]] mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (<code>\</code>, the [[backslash]]) and U+007E [[tilde|TILDE]] (<code>~</code>) respectively,<ref name="msmapping">{{cite web | url=https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT | title=CP932.TXT | publisher=Unicode Consortium}}</ref><ref name="msrefrender">{{cite web | url=https://msdn.microsoft.com/en-us/library/cc194889.aspx | title=Lead byte NULL — Code page 932 | publisher=Microsoft}}</ref><ref name="w3cjpprof"/> as they are in ASCII ([[ISO 646|ISO-646]]-US). This is likewise done by the W3C/WHATWG encoding standard.<ref>{{cite web | url=https://encoding.spec.whatwg.org/#shift_jis-decoder | title=12.3.1. Shift_JIS decoder | publisher=WHATWG | work=Encoding Standard | quotation=If byte is an ASCII byte or 0x80, return a code point whose value is byte. |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> By contrast, 0x5C is mapped to U+00A5 [[Yen sign|YEN SIGN]] (<code>¥</code>) in [[Code page 895|ISO-646-JP]] and consequently [[JIS X 0201]], of which standard [[Shift JIS]] is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.<ref name="w3cjpprof" />
 
Line 56 ⟶ 55:
 
==See also==
* [[LMBCS-16]]
 
==References==
Line 63 ⟶ 62:
==External links==
 
=== Microsoft related ===
*[https://web.archive.org/web/20180405210602/http://msdn.microsoft.com/en-us/library/cc194887.aspx Microsoft's Reference for Windows Code Page 932]
*[https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt Code page file for MS932]
Line 69 ⟶ 68:
*[http://demo.icu-project.org/icu-bin/convexp?conv=windows-31j ICU Code Page 943C (ibm-943_P15A-2003 alias windows-31j) demonstration]
 
=== IBM related ===
*[https://web.archive.org/web/20160315110642/http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html IBM's documentation of Code Page 943]
*[http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943 ICU Code Page 943 (ibm-943_P130-1999) demonstration]