Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
UTF-8: No chcp did not work. Conversely, remove scare tactics, funcitons that think characters have only two bytes do not "fail" when handed the prefix of a UTF-8 character
String constants: String constants in VS; other compilers may differ.
 
(183 intermediate revisions by 48 users not shown)
Line 1:
{{Short description|Overview on Unicode implementation in Microsoft Windows}}
{{refimprovemore citations needed|date=June 2011}}
[[Microsoft]] startedwas toone consistentlyof the first companies to implement [[Unicode]] in their products quite early.{{clarify|date=July 2012}} [[Windows NT]] was the first operating system that used "wide characters" in [[system call]]s. Using the (now obsolete) [[UCS-2]] encoding scheme at first, it was upgraded to the [[variable-width encoding]] [[UTF-16]] starting with [[Windows 2000]], allowing a representation of additional planes with surrogate pairs. However Microsoft did not support [[UTF-8]] in its API until May 2019.
 
Before 2019, Microsoft emphasized UTF-16 (i.e. -W API), but has since recommended to use [[UTF-8]] (at least in some cases),<ref name="Microsoft-UTF-8" /> on Windows and [[Xbox]] (and in other of its products), even states "UTF-8 is the universal code page for internationalization [and] UTF-16 [... is] a unique burden that Windows places on code that targets multiple platforms. [..] Windows [is] moving forward to support UTF-8 to remove this unique burden [resulting] in fewer internationalization issues in apps and games".<ref name="Microsoft GDK" />
 
A large amount of Microsoft documentation uses the word "Unicode" to refer explicitly to the UTF-16 encoding. Anything else, including UTF-8, is not "Unicode" in Microsoft's outdated language (while UTF-8 and UTF-16 are both Unicode according to [[Unicode|the Unicode Standard]], or encodings/"transformation formats" thereof).
 
== In various Windows families ==
 
=== Windows NT based systems ===
ModernCurrent Windows versions likeand [[Windowsall XP]]back andto [[Windows Server 2003XP]], and prior to them [[Windows NT]] (3.x, 4.0) and Windows 2000 are shipped with [[Windows API|system libraries]] whichthat support string [[character encoding|encoding]] of two types: UTF-16 (often called-bit "Unicode" in([[UTF-16]] since [[Windows documentation2000]]) and an locala (sometimes multibyte) encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''[[American National Standards Institute|ANSI]] code page''). 16-bit functions have names suffixed with -'W' (from [[wide character|"wide"]]) such as <code>SetWindowTextW</code>. Code page oriented functions use the suffix -'A' for "ANSI" such as <code>SetWindowTextA</code> (some other conventions were used for APIs that were copied from other systems, such as <code>_wfopen/fopen</code> or <code>wcslen/strlen</code>). This split was necessary because many languages, including [[C (programming language)|C]], did not provide a clean way to pass both 8-bit and 16-bit strings to the same function. Most such 'A' functions are implemented as a [[Wrapper function|wrapper]] that translates the code page to UTF-16 and calls the 'W' function.
{{issues|Name of actual A/W functions (example of lstrlen is incorrect, the 'l' indicates wide characters)}}
 
[[Microsoft]] attempted to support Unicode "portably" by providing a "UNICODE" switch to the compiler, that switches unsiffixedunsuffixed "generic" calls from the 'A' to the 'W' interface and converts all string constants to "wide" UTF-16 versions.<ref>{{cite web|title=Unicode in the Windows API|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx|accessdateaccess-date=7 May 2018}}</ref><ref>{{cite web|title=Conventions for Function Prototypes (Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx|website=MSDN|accessdateaccess-date=7 May 2018|language=en}}</ref> This does not actually work because it does not translate UTF-8 outside of string constants, resulting in code that attempts to open files just not compiling.{{citation orneeded|date=October accidentally calling the 'A' version anyway.2019}}
Modern Windows versions like [[Windows XP]] and [[Windows Server 2003]], and prior to them [[Windows NT]] (3.x, 4.0) and Windows 2000 are shipped with [[Windows API|system libraries]] which support string [[character encoding|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an local (sometimes multibyte) encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character|"wide"]]). Code page oriented functions use the suffix -A for "ANSI". This split was necessary because many languages, including C, did not provide a clean way to pass both 8-bit and 16-bit strings to the same function. Most such 'A' functions are implemented as a [[Wrapper function|wrapper]] that translates the code page to UTF-16 and calls the 'W' function.
 
Earlier, and independent of the "UNICODE" switch, Windows also providesprovided the "Multibyte Character Sets (MBCS") API switch.<ref>{{cite web|title=Support for Multibyte Character Sets (MBCSs)|url=https://msdndocs.microsoft.com/en-us/librarycpp/5z097dxa.aspxtext/support-for-multibyte-character-sets-mbcss?view=vs-2019|access-date=2020-06-15|language=en}}</ref> This switchchanges turnssome onfunctions somethat Cdon't functionswork prefixedin MBCS such as with<code>_mbsstrrev</code>, andto selectsan theMBCS-aware 'A'one functionssuch foras the current locale<code>_mbsrev</code>.<ref>{{cite web|title=Double-byte Character Sets|url=https://msdndocs.microsoft.com/en-us/library/windows/desktopwin32/dd317794(v=vs.85).aspxintl/double-byte-character-sets|website=MSDN|accessdateaccess-date=2020-06-15|date=7 May 2018-05-31|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref><ref>[https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrev-wcsrev-mbsrev-mbsrev-l _strrev, _wcsrev, _mbsrev, _mbsrev_l] Microsoft Docs</ref>
Microsoft attempted to support Unicode "portably" by providing a "UNICODE" switch to the compiler, that switches unsiffixed "generic" calls from the 'A' to the 'W' interface and converts all string constants to "wide" UTF-16 versions.<ref>{{cite web|title=Unicode in the Windows API|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx|accessdate=7 May 2018}}</ref><ref>{{cite web|title=Conventions for Function Prototypes (Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|language=en}}</ref> This does not actually work because it does not translate UTF-8 outside of string constants, resulting in code that attempts to open files just not compiling or accidentally calling the 'A' version anyway.
 
Earlier, and independent of the "UNICODE" switch, Windows also provides the "MBCS" API switch.<ref>{{cite web|title=Support for Multibyte Character Sets (MBCSs)|url=https://msdn.microsoft.com/en-us/library/5z097dxa.aspx|language=en}}</ref> This switch turns on some C functions prefixed with<code>_mbs</code>, and selects the 'A' functions for the current locale.<ref>{{cite web|title=Double-byte Character Sets|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317794(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref>
 
=== Windows CE ===
In (the now discontinued) [[Windows CE]], UTF-16 was used almost exclusively, with the 'A' API mostly missing.<ref>{{cite web|title=Differences Between the Windows CE and Windows NT Implementations of TAPI|url=https://msdn.microsoft.com/en-us/library/aa454022.aspx|website=MSDN|accessdatedate=28 August 2006 |access-date=7 May 2018|quote=Windows CE is Unicode-based. You might have to recompile source code that was written for a Windows NT-based application.}}</ref> A limited set of ANSI API is available in Windows CE 5.0, for use on a reduced set of locales that may be selectively built onto the runtime image.<ref>{{cite web|title=Code Pages (Windows CE 5.0)|url=https://docs.microsoft.com/en-us/previous-versions/windows/embedded/ms903783(v=msdn.10)|website=Microsoft Docs|accessdate date=14 September 2012 |access-date=7 May 2018|language=en-us}}</ref>
{{expand section|date=June 2011}}
 
=== Windows 9x ===
{{mainMain article|Microsoft Layer for Unicode}}
In 2001, Microsoft released a special supplement to Microsoft’sMicrosoft's old [[Windows 9x]] systems. It includes a dynamic link library, 'unicows.dll', (only 240 &nbsp;KB) containing the 16-bit flavor (the ones with the letter W on the end) of all the basic functions of Windows API. It is merely a translation layer: <code>SetWindowTextW</code> will simply convert its input using the current codepage and call <code>SetWindowTextA</code>.
 
== UTF-8 ==
DespiteMicrosoft being one of the earliest proponents of Unicode, it can be claimed thatWindows ([[Windows doesXP]] notand supportlater) Unicode in portable files. This is because, due tohas a numbercode ofpage odddesignated decisions, the file system api used by standard interfaces in C and C++ libraries cannot be convinced to takefor [[UTF-8]], whichcode ispage the65001<ref>{{cite standardweb|title=Code methodPage ofIdentifiers providing(Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx|website=msdn.microsoft.com| themdate=7 withJanuary Unicode.2021 Microsoft|language=en}}</ref> Windows has aor <code>CP_UTF8</code>. pageFor designateda forlong UTF-8time, [[code page 65001]]. Until recently it was impossible to set the locale code page to 65001, (theleaving this code page only available for a) explicit conversion functions such as MultiByteToWideChar). Ifand/or thisb) wasthe possible[[Win32 thenconsole]] itcommand would be possible to write<code>chcp 65001</code> to opentranslate astdin/out filebetween usingUTF-8 aand UTF-8 string16. ThereThis aremeant (were?)that also"narrow" seriousfunctions, problemsin withparticular getting<code>[[C Microsoftfile compilersinput/output#fopen|fopen]]</code> to(which produceopens files), couldn't be called with UTF-8 stringstrings, constants.and Thein mostfact reliablethere methodwas no isway to turnopen ''off''all UNICODE,possible ''not''files using <code>fopen</code> no matter markwhat the inputlocale filewas asset being UTF-8,to and/or arrangewhat bytes were put in the string, constantsas tonone haveof the UTF-8available byteslocales (perhapscould usingproduce anall editorpossible thatUTF-16 wouldcharacters. editThis UTF-8problem butalso notapplied putto aall UTFother byteAPIs orderthat marktake onor thereturn start8-bit ofstrings, theincluding savedWindows file)ones such as <code>SetWindowText</code>.
 
Programs that wanted to use UTF-8, in particular code intended to be portable to other operating systems, needed a workaround for this deficiency. The usual work-around was to add new functions to open files that convert UTF-8 to UTF-16 using [[MultiByteToWideChar]] and call the "wide" function instead of <code>fopen</code>.<ref>{{cite web|url=https://stackoverflow.com/questions/166503/utf-8-in-windows|title=UTF-8 in Windows|publisher=[[Stack Overflow]]|access-date=July 1, 2011}}</ref> Dozens of multi-platform libraries added wrapper functions to do this conversion on Windows (and pass UTF-8 through unchanged on others), an example is a proposed addition to [[Boost (C++ libraries)|Boost]], {{tt|Boost.Nowide}}.<ref>{{cite web|url=https://github.com/boostorg/nowide|title=Boost.Nowide|website=[[GitHub]]}}</ref> Another popular work-around was to convert the name to the [[8.3 filename]] equivalent, this is necessary if the <code>fopen</code> is inside a library. None of these workarounds are considered good, as they require changes to the code that works on non-Windows.
 
Since insider build 17035 and theIn April 2018 update (nominalor buildpossibly 17134)November for Windows 102017<ref>{{cite web|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI|url=https://news.ycombinator.com/item?id=15710685|website=Hacker News|accessdateaccess-date=7 May 2018}}</ref>), with insider build 17035 (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTfUTF-8 for worldwide language support" checkbox is availableappeared for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "AdministativeAdministrative" tab, "Change system locale" button.}} AssumingThis aallows processfor cancalling just"narrow" forcefunctions, thisincluding state<code>fopen</code> onand itself at startup<code>SetWindowTextA</code>, andwith thatUTF-8 thestrings. compilersHowever havethis beenis improveda tosystem-wide notsetting translateand bytesa fromprogram thecannot source when making string constants,assume it canis be claimed that Windows has solved this problem and now fully supports Unicodeset.
 
In May 2019, Microsoft added the ability for a program to set the code page to UTF-8 itself,<ref name="Microsoft-UTF-8">{{cite web|title=Use UTF-8 code pages in Windows apps|url=https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page |access-date=2020-06-06 |quote=As of Windows version&nbsp;1903 (May&nbsp;2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [...] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows version&nbsp;1903 (May&nbsp;2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly. |website=learn.microsoft.com |language=en-us}}</ref><ref>{{cite web|url=https://skanthak.homepage.t-online.de/quirks.html#quirk31|title=Windows 10 1903 and later versions finally support UTF-8 with the A forms of the Win32 functions}}</ref> allowing programs written to use UTF-8 to be run by non-expert users.
 
{{As of|2019}}, Microsoft recommends programmers use UTF-8 (e.g. instead of any other 8-bit encoding),<ref name="Microsoft-UTF-8">{{cite web|title=Use UTF-8 code pages in Windows apps|url=https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page |access-date=2020-06-06 |quote=As of Windows version&nbsp;1903 (May&nbsp;2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [...] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows version&nbsp;1903 (May&nbsp;2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly. |website=learn.microsoft.com |language=en-us}}</ref> on Windows and [[Xbox]], and may be recommending its use instead of UTF-16, even stating "UTF-8 is the universal code page for internationalization [and] UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms."<ref name="Microsoft GDK">{{Cite web |title=UTF-8 support in the Microsoft Game Development Kit (GDK) - Microsoft Game Development Kit |url=https://learn.microsoft.com/en-us/gaming/gdk/_content/gc/system/overviews/utf-8 |access-date=2023-03-05 |website=learn.microsoft.com |date=19 August 2022 |language=en-us |quote=By operating in UTF-8, you can ensure maximum compatibility [..] Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. [..] The Microsoft Game Development Kit (GDK) and Windows in general are moving forward to support UTF-8 to remove this unique burden of Windows on code targeting or interchanging with multiple platforms and the web. Also, this results in fewer internationalization issues in apps and games and reduces the test matrix that's required to get it right.}}</ref> Microsoft does appear to be transitioning to UTF-8, stating it previously emphasized its alternative, and in [[Windows 11]] some system files are required to use UTF-8 and do not require a Byte Order Mark.<ref>{{Cite web|title=Customize the Windows 11 Start menu|url=https://docs.microsoft.com/en-us/windows-hardware/customize/desktop/customize-the-windows-11-start-menu|access-date=2021-06-29|website=docs.microsoft.com|language=en-us|quote=Make sure your LayoutModification.json uses UTF-8 encoding.}}</ref> Notepad can now recognize UTF-8 without the Byte Order Mark, and can be told to write UTF-8 without a Byte Order Mark.{{cn|date=November 2022}} Some other Microsoft products are using UTF-8 internally, including Visual Studio<ref>{{cite web|title=New Options for Managing Character Sets in the Microsoft C/C++ Compiler|url=https://devblogs.microsoft.com/cppblog/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/#how-the-microsoft-c/c++-compiler-reads-text-from-a-file|website=devblogs.microsoft.com| date=22 February 2016 |language=en |quote=At some point in the past, the Microsoft compiler was changed to use UTF-8 internally. So, as files are read from disk, they are converted into UTF-8 on the fly.}}</ref><ref>{{ cite web | title=validate-charset (validate for compatible characters) | website=docs.microsoft.com |language=en-us | url=https://docs.microsoft.com/en-us/cpp/build/reference/validate-charset-validate-for-compatible-characters | access-date=2021-07-19 | quote=Visual Studio uses UTF-8 as the internal character encoding during conversion between the source character set and the execution character set. }}</ref> and their [[SQL Server 2019]], with Microsoft claiming 35% speed increase from use of UTF-8, and "nearly 50% reduction in storage requirements."<ref>{{Cite web|date=2019-07-02|title=Introducing UTF-8 support for SQL Server|url=https://techcommunity.microsoft.com/t5/sql-server/introducing-utf-8-support-for-sql-server/ba-p/734928|quote=For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. [..] In the ASCII range, when doing intensive read/write I/O on UTF-8<!-- " " in quote, but ok to strip-->, we measured an average 35% performance improvement over UTF-16 using clustered tables with a non-clustered index on the string column, and an average 11% performance improvement over UTF-16 using a heap. |access-date=2021-08-24|website=techcommunity.microsoft.com|language=en}}</ref>
There are (were?) proposals to add new API to portable libraries such as [[Boost (C++ libraries)|Boost]] to do the necessary conversion, with new functions for opening and renaming files that take UTF-8. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.<ref>{{cite web|url=http://cppcms.com/files/nowide/html/|title=Boost.Nowide}}</ref>
 
=== String constants in Visual Studio ===
Many applications imminently have to support UTF-8 because it is the most-used Unicode encoding scheme in various [[network protocol]]s, including the [[Internet Protocol Suite]]. An application which has to pass UTF-8 to or from a 'W' [[Windows API]] should call the functions [[MultiByteToWideChar]] and WideCharToMultiByte.<ref>{{cite web|url=https://stackoverflow.com/questions/166503/utf-8-in-windows|title=UTF-8 in Windows|publisher=[[Stack Overflow]]|accessdate=July 1, 2011}}</ref> To get predictable handling of errors and surrogate halves, and to insure UTF-8 is used, it is more common for software to implement their own versions of these functions.
Before 2019 Microsoft's compilers could not produce UTF-8 string constants from UTF-8 source files. This is due to them converting all strings to the locale code page (which could not be UTF-8). At one time the only method to work around this was to turn ''off'' {{tt|UNICODE}}, and ''not'' mark the input file as being UTF-8 (i.e. do not use a [[UTF-8#Byte order mark|BOM]]).<ref>[http://utf8everywhere.org/#faq.literal UTF-8 Everywhere FAQ: How do I write UTF-8 string literal in my C++ code?] (note that the {{tt|u8"text"}} proposed solution does not work, string is still mangled)</ref> This would make the compiler think both the input and outputs were in the same single-byte locale, and leave strings unmolested.
 
== See also ==
Since insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10<ref>{{cite web|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI|url=https://news.ycombinator.com/item?id=15710685|website=Hacker News|accessdate=7 May 2018}}</ref>, a "Beta: Use Unicode UTf-8 for worldwide language support" checkbox is available for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administative" tab, "Change system locale" button.}} Assuming a process can just force this state on itself at startup, and that the compilers have been improved to not translate bytes from the source when making string constants, it can be claimed that Windows has solved this problem and now fully supports Unicode.
* [[Bush hid the facts]], a text encoding [[mojibake]]
 
== Notes ==
{{notelist}}
 
Line 37 ⟶ 47:
 
== External links ==
* {{cite web |url=http://msdn.microsoft.com/en-us/library/dd374081(v=vs.85).aspx |title=Unicode |work=[[MSDN]] |publisher=[[Microsoft]] |accessdateaccess-date=November 10, 2016}}
 
[[Category:Windows technology|Unicode]]