Code page 932 (Microsoft Windows): Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 16:13, 8 March 2018 edit HarJIT (talk \| contribs) Extended confirmed users 12,435 edits No edit summary ← Previous edit		Latest revision as of 13:38, 14 August 2025 edit undo 2a0e:1d47:9098:3800:2d3f:2be2:c623:63a5 (talk) →Double-byte character differences: quote 'because' and 'not' where they're literals
(42 intermediate revisions by 13 users not shown)
Line 1: {{Short description\|Windows character set for Japanese}} {{~~about~~About\|Microsoft's Code Page 932 and IBM's Code Page 943\|IBM's Code Page 932\|Code page 932 (IBM)}} {{~~redirect~~Redirect\|Windows-31J\|the operating system version\|Windows 3.1J}} {{~~infobox~~Infobox character encoding \| name = Windows Code page 932 \| mime = Windows-31J \| alias = CP943C \| standard = [[WHATWG Encoding Standard]] (as "Shift_JIS")<ref name="encoding_rs">{{cite web \|url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming \|title=Notable Differences from IANA Naming \|work=Crate encoding_rs \|publisher=docs.rs \|author=Mozilla Foundation \|author-link=Mozilla Foundation}}</ref> ~~\| standard = WHATWG Encoding Standard (as "Shift_JIS")~~ \| lang = [[Japanese language\|Japanese]] \| status = Line 11 ⟶ 12: \| prev = \| next = \| classification = [[Extended ASCII]],{{efn\|Not in the strictest sense of the term, as ASCII bytes can appear as trail bytes.}} [[~~Variable~~variable-width encoding]], [[CJK characters\|CJK encoding]] \| extra = <div style="text-align: left;">{{notelist}}</div> }} '''Microsoft Windows code page 932''' (abbreviated '''MS932''',<ref>{{cite web \| url=https://www.w3.org/Bugs/Public/show_bug.cgi?id=27851 \| title=Bug 27851 - Add MS932 as a label of Shift_JIS \| work=w3.org Bug Tracker \| author=Sivonen, Henri}}</ref><ref name="icuwindows31j" /> '''Windows-932'''<ref name="icuwindows31j">{{cite web \| url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=- \| title=Converter Explorer: ibm-943_P15A-2003 (alias windows-31j) \| work=International Components for Unicode: ICU Demonstration}}</ref> or ambiguously '''CP932'''<ref>{{cite web\|url=https://www.debian.org/doc/manuals/debian-reference/ch11.en.html\|title=Chapter 11. Data conversion\|work=Debian Reference\|last=Aoki\|first=Osamu\|publisher=Debian}}</ref>), also called '''Windows-31J''' amongst other names (see [[#Terminology\|§ Terminology]] below), is the [[Microsoft Windows]] [[code page]] for the [[Japanese language]], which is an extended variant of the [[Shift JIS]] Japanese [[character encoding]]. It contains standard 7-bit [[ASCII]] codes, and Japanese characters are indicated by the high bit of the first byte being set to 1. Some code points in this page require a second byte, so characters use either 8 or 16 bits for encoding. IBM offer the same extended double-byte codes in their '''[[code page]] 943''' ('''IBM-943''' or '''CP943'''),<ref name="ibm932v943">{{cite web \| url=https://www.ibm.com/support/knowledgecenter/en/ssw_aix_71/com.ibm.aix.nlsgdrf/ibm-943_ibm-932.htm \| title=IBM-943 and IBM-932 \| publisher=IBM \| work=IBM Knowledge Center}}</ref> which is a combination of the single-byte [[Code page 897]] and the double-byte '''Code page 941'''.<ref name="ibm943">{{cite web \| url=http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html \| title=~~Code~~Coded ~~Page~~character set identifiers - CCSID 943 \| publisher=IBM \| work=IBM Globalization \| archive-url=https://web.archive.org/web/20160315110642/http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html \| archive-date=2016-03-15}}</ref> Windows-31J is the most used non-[[UTF-8]]/Unicode Japanese encoding on the web. However, many people and software packages, including Microsoft libraries,<ref name="msdnlabels"/> declare the {{nowrap\|[[Shift JIS]]}} encoding for Windows-31J data, although it includes some additional characters, and some of the existing characters are mapped to [[Unicode]] differently. This has led the WHATWG HTML standard to treat the encoding labels {{code\|shift_jis}} and {{code\|windows-31j}} interchangeably, and use the Windows variant for its "Shift_JIS" encoder and decoder.<ref name="encoding_rs"/><!-- Per W3C / WHATWG standards, the labels Shift_JIS and Windows-31J are treated the same; the W3C/WHATWG spec uses the Shift JIS name, but its definition actually matches Windows-31J (not JIS X 0208 Appendix 1). --> == Terminology ==▼ ▲== Terminology == Microsoft's Shift JIS variant is known simply as "Code page 932" on Microsoft Windows, however this is ambiguous as [[IBM-932\|IBM's code page 932]], while also a Shift JIS variant, lacks the NEC and NEC-selected double-byte vendor extensions which are present in Microsoft's variant (although both include the IBM extensions) and preserves the 1978 ordering of JIS X 0208.<ref name="ibm932v943" /> IBM's code page 943 (or "IBM-943") includes the same double byte codes as Windows code page 932.<ref name="ibm932v943" /> Microsoft's version corresponds closely to the encoding referred to as '''ibm-943_P15A-2003''' (with aliases including '''CP943C''' and '''Windows-932''')<ref name="icuwindows31j" /> in [[International Components for Unicode]] (ICU). There is also a second ICU encoding named '''ibm-943_P130-1999''',<ref name="icuibm943" /> which uses different single-byte mappings which more closely match IBM's code page definitions. (See [[#Single-byte character differences\|§ Single-byte character differences]] below for details.) Windows code page 932 is registered with the [[Internet Assigned Numbers Authority\|IANA]] as '''Windows-31J'''.<ref name="iana31j">{{cite web \| url=https://www.iana.org/assignments/character-sets/character-sets.xhtml \| publisher=IANA \| title=Character Sets}}</ref> The "Windows-31J" label is IANA's and not recognized by Microsoft, which has historically used "shift_jis" instead.<ref name="msdnlabels">{{cite web\|url=https://msdn.microsoft.com/en-us/library/system.text.encoding.windowscodepage(v=vs.110).aspx \|title=Encoding.WindowsCodePage Property - .NET Framework (current version) \|work=MSDN \|publisher=Microsoft}}</ref> The [[W3C]]/[[WHATWG]] encoding standard used by [[HTML5]] treats the label "'''shift_jis'''" interchangeably with "windows-31j" with the intent of being "compatible with deployed content"<ref>{{cite web \| url=https://encoding.spec.whatwg.org/#names-and-labels \| title=4.2. Names and labels \| publisher=WHATWG \| work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> and matches Windows code page 932<ref name="encoding_rs"/> (including the "formerly proprietary extensions from IBM and NEC").<ref>{{cite web \| url=https://encoding.spec.whatwg.org/#index-jis0208 \| title=5. Indexes (§ Index jis0208) \| publisher=WHATWG \| work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Windows code page 932 is also called '''MS_Kanji''',<ref name="icuwindows31j" /><ref name="python">{{cite web \| url=https://docs.python.org/3.6/library/codecs.html#standard-encodings \| title=7.2.3. Standard Encodings \| publisher=Python Software Foundation \| work=Python 3.6 Documentation \| ~~accessdate~~access-date=19 September 2017}}</ref> although IANA treat MS_Kanji as an alias for standard Shift JIS.<ref name="iana31j"/> [[Python (programming language)\|Python]], for example, uses the label <code>MS-Kanji</code> (or <code>cp932</code>) for Windows-932 and the label <code>Shift_JIS</code> (or <code>sjis</code>) for JIS X 0208-defined Shift JIS, without recognising the <code>Windows-31J</code> label.<ref name="python" /> In Japanese editions of Windows, this code page is [[Windows code page#ANSI code page\|referred to as "ANSI"]], since it is the operating system's default 8-bit encoding, even though [[ANSI]] was not involved in its definition. == Differences from standard Shift JIS ==▼ ▲== Differences from standard Shift JIS == Windows-31J is often mistaken for standard Shift JIS (as defined in [[JIS X 0208]]:1997 Appendix 1): while similar, the distinction is significant for computer programmers wishing to avoid [[mojibake]]. === Double-byte character differences === [[File:Euler diag for jp charsets.svg\|thumb\|[[Euler diagram]] comparing repertoires of [[JIS X 0208]], [[JIS X 0212]], [[JIS X 0213]], Windows-31J, the Microsoft standard repertoire and [[Unicode]] ]] In addition to the standard [[JIS X 0201]]:1997 and [[JIS X 0208]]:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "[[JIS X 0208#0x2D\|NEC special characters]] (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",<ref name="iana31j" /> in addition to setting some encoding space aside for [[Private Use Areas#Private-use characters in other character sets\|end user definition]].<ref>{{cite web \| url=http://archives.miloush.net/michkap/archive/2007/05/26/2901371.html \| title=The PUA outside of Unicode \| author=Kaplan, Michael S \| work=Sorting it all out \| date=2007-05-26}}</ref> This also differs from [[Code page 932 (IBM)\|IBM-932]], which does not include the NEC extensions or NEC selection.<ref name="ibm932v943"/>▼ The IBM extensions were designed to encode characters from the [[Japanese language in EBCDIC#Double-byte codes\|IBM Japanese DBCS-Host]] repertoire which were initially absent in JIS X 0208; the [[because sign\|'because' sign]] ∵ and [[not sign\|'not' sign]] ￢ were later added to JIS X 0208 itself in 1983, and Microsoft includes them at extension locations as well as their 1983 locations.<ref name="lundeE">{{citation\|mode=cs1 \|title=Appendix E: Vendor Character Set Standards \|work=CJKV Information Processing: Chinese, Japanese, Korean & Vietnamese Computing \|last=Lunde \|first=Ken \|author-link=Ken Lunde \|year=2009 \|edition=2nd \|publisher=[[O'Reilly Media\|O'Reilly]] \|___location=[[Sebastopol, CA]] \|isbn=978-0-596-51447-1 \|url=https://resources.oreilly.com/examples/9780596514471/blob/master/cjkvip2e-appE.pdf}}</ref> The NEC extensions also encode the entirety of the IBM repertoire, but in a separate extension within the 94×94 JIS X 0208 grid (in rows 89–92, besides the characters already included in [[JIS X 0208#0x2D\|NEC row 13]]), rather than using Shift JIS codes beyond the JIS X 0208 range; Windows code page 932 includes these 388 characters in both locations.<ref name="lundeE"/> As a result, the 'because' and 'not' signs are encoded three times. ▲In addition to the standard [[JIS X 0201]]:1997 and [[JIS X 0208]]:1997 characters, Windows-31J includes several JIS X 0208 extensions, namely "NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",<ref name="iana31j" /> in addition to setting some encoding space aside for [[Private Use Areas#Private-use characters in other character sets\|end user definition]].<ref>{{cite web \| url=http://archives.miloush.net/michkap/archive/2007/05/26/2901371.html \| title=The PUA outside of Unicode \| author=Kaplan, Michael S \| work=Sorting it all out \| date=2007-05-26}}</ref> This also differs from [[Code page 932 (IBM)\|IBM-932]], which does not include the NEC extensions or NEC selection.<ref name="ibm932v943"/> Some of these representations were subsequently used for different characters by [[JIS X 0213]] and [[Shift JIS-2004]]. For example, compare row 89 in JIS X 0213 (beginning 硃, 硎, 硏…)<ref>{{cite ~~web \| url=https://www.itscj.ipsj.or.jp/~~iso-ir/ \|number=233~~.pdf~~ \| title=~~233:~~ Japanese Graphic Character Set for Information Interchange, Plane 1 \|sponsor=Japanese ~~publisher~~Industrial Standards Committee \|sponsor-link=~~IPSJ~~Japanese Industrial Standards Committee \|date=2004-04-13}}</ref> to row 89 as used by JIS X 0208 with IBM/NEC extensions (beginning 纊, 褜, 鍈…).<ref>{{cite web \| url=https://encoding.spec.whatwg.org/jis0208.html \| title=Index jis0208 visualization \| publisher=WHATWG \| work=Encoding Standard \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> Consequently, Shift JIS-2004 is not compatible with Windows-31J. In addition to the above, Microsoft uses different (but visually similar) Unicode mapping for several double-byte punctuation characters compared to standard Shift JIS, such as the [[wave dash]] being [[Tilde#Unicode and Shift JIS encoding of wave dash\|mapped to U+FF5E]] rather than U+301C,<ref name="w3cjpprof">{{cite web \| url = https://www.w3.org/TR/japanese-xml/#ambiguity_of_yen \| title = Ambiguities in conversion from Shift-JIS to Unicode (Non-Normative) \| work = XML Japanese Profile \| publisher=W3C}}</ref> which is followed by ibm-943_P15A-2003<ref>{{cite web \| url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&b=81&s=ALL#layout \| title=Converter Explorer: ibm-943_P15A-2003: start byte 0x81 \| publisher=International Components for Unicode \| work=ICU Demonstration}}</ref> but not ibm-943_P130-1999,<ref>{{cite web \| url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&b=81&s=ALL#layout \| title=Converter Explorer: ibm-943_P130-1999: start byte 0x81 \| publisher=International Components for Unicode \| work=ICU Demonstration}}</ref> and using different mapping for the double byte backslash.<ref name="w3cjpprof" /> === Single-byte character differences === Windows-932 includes standard 7-bit [[ASCII]] mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (<code>\</code>, the [[backslash]]) and U+007E [[tilde\|TILDE]] (<code>~</code>) respectively,<ref name="msmapping">{{cite web \| url=~~http~~https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT \| title=CP932.TXT \| publisher=Unicode Consortium}}</ref><ref name="msrefrender">{{cite web \| url=https://msdn.microsoft.com/en-us/library/cc194889.aspx \| title=Lead byte NULL — Code page 932 \| publisher=Microsoft}}</ref><ref name="w3cjpprof"/> as they are in ASCII ([[ISO 646\|ISO-646]]-US). This is likewise done by the W3C/WHATWG encoding standard.<ref>{{cite web \| url=https://encoding.spec.whatwg.org/#shift_jis-decoder \| title=12.3.1. Shift_JIS decoder \| publisher=WHATWG \| work=Encoding Standard}} "\| quotation=If byte is an ASCII byte or 0x80, return a code point whose value is byte." \|last=van Kesteren \|first=Anne \|author-link=Anne van Kesteren}}</ref> By contrast, 0x5C is mapped to U+00A5 [[Yen sign\|YEN SIGN]] (<code>¥</code>) in [[Code page 895\|ISO-646-JP]] and consequently [[JIS X 0201]], of which standard [[Shift JIS]] is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.<ref name="w3cjpprof" />▼ ▲Windows-932 includes standard 7-bit [[ASCII]] mappings for single-byte sequences with the high bit set to 0. Hence, codes 0x5C and 0x7E are mapped to Unicode as U+005C REVERSE SOLIDUS (<code>\</code>, the [[backslash]]) and U+007E [[tilde\|TILDE]] (<code>~</code>) respectively,<ref name="msmapping">{{cite web \| url=http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT \| title=CP932.TXT \| publisher=Unicode Consortium}}</ref><ref name="msrefrender">{{cite web \| url=https://msdn.microsoft.com/en-us/library/cc194889.aspx \| title=Lead byte NULL — Code page 932 \| publisher=Microsoft}}</ref><ref name="w3cjpprof"/> as they are in ASCII ([[ISO 646\|ISO-646]]-US). This is likewise done by the W3C/WHATWG encoding standard.<ref>{{cite web \| url=https://encoding.spec.whatwg.org/#shift_jis-decoder \| title=12.3.1. Shift_JIS decoder \| publisher=WHATWG \| work=Encoding Standard}} "If byte is an ASCII byte or 0x80, return a code point whose value is byte."</ref> By contrast, 0x5C is mapped to U+00A5 [[Yen sign\|YEN SIGN]] (<code>¥</code>) in [[Code page 895\|ISO-646-JP]] and consequently [[JIS X 0201]], of which standard [[Shift JIS]] is an extension. Correspondingly, Windows-31J avoids duplicate encoding of the backslash by mapping the double byte 0x815F to U+FF3C FULLWIDTH REVERSE SOLIDUS, whereas standard Shift JIS maps it to U+005C.<ref name="w3cjpprof" /> However, 0x5C in Windows-932 is nonetheless considered a Yen sign in certain contexts.<ref name="kaplan">{{cite web \| title=When is a backslash not a backslash? \| date=2005-09-17 \| author=Kaplan, Michael S. \| url=http://archives.miloush.net/michkap/archive/2005/09/17/469941.html \| work=Sorting it all out}}</ref> For this reason, in many Japanese fonts, U+005C is displayed as a Yen symbol, which would normally be represented as U+00A5, rather than as a backslash per Unicode's suggested rendering. U+00A5 is one-way best-fit mapped onto 0x5C in Windows-932. However, code 0x5C in Windows-932 behaves as a reverse solidus (backslash) in all respects (e.g. in [[filename\|file paths]] on Windows systems) other than how it is displayed by some fonts,<ref name="kaplan" /> and Microsoft's documentation for Windows-932 displays 0x5C as a backslash.<ref name="msrefrender" /> This mapping<ref name="msmapping" /> corresponds to the encoding named "ibm-943_P15A-2003" in [[International Components for Unicode]] (ICU),<ref name="icuwindows31j" /> except for minor reordering of a few [[C0 control characters]]. [[Code page 437\|IBM-943]], like [[Code page 932 (IBM)\|IBM-932]],<ref name="ibm932v943"/> is a superset of the single-byte [[Code page 897]],<ref name="ibm943"/> which maps 0x5C to the Yen symbol (<code>¥</code>) and 0x7E to the overline (<code>‾</code>),<ref name="cp00897txt">{{cite web \| url=~~ftp~~https://~~ftp~~public.~~software~~dhe.ibm.com/software/globalization/gcoc/attachments/CP00897.txt \| title=CP00897.txt \| publisher=IBM}}</ref> this is followed by the encoding named "ibm-943_P130-1999" in ICU.<ref name="icuibm943">{{cite web \| url=http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943 \| work=International Components for Unicode: ICU Demonstration \| title=Converter Explorer: ibm-943_P130-1999}}</ref> Code page 897 (and therefore also IBM-943 and IBM-932) also adds single-byte box-drawing characters replacing certain [[C0 control characters]],<ref name="cp00897txt" /> however these may still be treated as control characters depending on the context,<ref>{{cite web \| url=http://www-01.ibm.com/software/globalization/cp/cp00897.html \| title=Code page identifiers - CP 00897 \| publisher=IBM \| work=IBM Globalization \| url-status=dead \| archive-url=https://web.archive.org/web/20160317053427/http://www-01.ibm.com/software/globalization/cp/cp00897.html \| archive-date=2016-03-17}}</ref> and are mapped to control characters in ICU.<ref name="icuibm943" /> ==Layout== Line 54 ⟶ 57: ==See also== * [[LMBCS-16]] [[Code page 942]] ==References== Line 62 ⟶ 64: ==External links== === Microsoft related === [https://web.archive.org/web/20180405210602/http://msdn.microsoft.com/en-us/library/cc194887.aspx Microsoft's Reference for Windows Code Page 932] [~~http~~https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt Code page file for MS932] [~~http~~https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT Mapping of Microsoft's Code Page 932 to Unicode] [http://demo.icu-project.org/icu-bin/convexp?conv=windows-31j ICU Code Page 943C (ibm-943_P15A-2003 alias windows-31j) demonstration] === IBM related === [https://web.archive.org/web/20160315110642/http://www-01.ibm.com/software/globalization/ccsid/ccsid943.html IBM's documentation of Code Page 943] [http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943 ICU Code Page 943 (ibm-943_P130-1999) demonstration] [~~http~~https://~~icu~~raw.githubusercontent.com/unicode-~~project.~~org~~/repos~~/icu/~~data~~master/~~trunk~~icu4c/~~charset~~source/data/~~ucm~~mappings/ibm-943_P130-1999.ucm ICU mapping for ibm-943_P130-1999 to Unicode] {{character encoding}} ~~[[Category:Character sets\|932]]~~ [[Category:Windows code pages\|932]] [[Category:Encodings of Japanese]]