Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
Windows NT based systems: IsTextUnicode is a tiny tiny detail. It is filenames that are the problem
UTF-8: No chcp did not work. Conversely, remove scare tactics, funcitons that think characters have only two bytes do not "fail" when handed the prefix of a UTF-8 character
Line 22:
 
== UTF-8 ==
Despite being one of the earliest proponents of Unicode, it can be claimed that Windows does not support Unicode in portable files. This is because, due to a number of odd decisions, the file system api used by standard interfaces in C and C++ libraries cannot be convinced to take [[UTF-8]] which is the standard method of providing them with Unicode. Microsoft Windows has a code page designated for UTF-8, [[code page 65001]]. Until recently it was impossible to set the locale code page to 65001 (the code page only available for explicit conversion functions such as MultiByteToWideChar). If this was possible then it would be possible to write code to open a file using a UTF-8 string. There are (were?) also serious problems with getting Microsoft compilers to produce UTF-8 string constants. The most reliable method is to turn ''off'' UNICODE, ''not'' mark the input file as being UTF-8, and arrange the string constants to have the UTF-8 bytes (perhaps using an editor that would edit UTF-8 but not put a UTF byte order mark on the start of the saved file).
Microsoft Windows has a code page designated for [[UTF-8]], [[code page 65001]]. Prior to Windows 10 insider build 17035 (November 2017)<ref>{{cite web|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI|url=https://news.ycombinator.com/item?id=15710685|website=Hacker News|accessdate=7 May 2018}}</ref>, it was impossible to set the locale code page to 65001, leaving this code page only available for:
 
There are (were?) proposals to add annew API to portable libraries such as [[Boost (C++ libraries)|Boost]] to do the necessary conversion, by addingwith new functions for opening and renaming files that take UTF-8. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.<ref>{{cite web|url=http://cppcms.com/files/nowide/html/|title=Boost.Nowide}}</ref>
* Explicit conversion functions such as MultiByteToWideChar
* A manual "chcp" command that only changes the code page for the current program's context. This is used for [[conhost.exe]] windows running [[Windows Subsystem for Linux]].
 
Many applications imminently have to support UTF-8 because it is the most-used Unicode encoding scheme in various [[network protocol]]s, including the [[Internet Protocol Suite]]. An application which has to pass UTF-8 to or from a 'W' [[Windows API]] should call the functions [[MultiByteToWideChar]] and WideCharToMultiByte.<ref>{{cite web|url=https://stackoverflow.com/questions/166503/utf-8-in-windows|title=UTF-8 in Windows|publisher=[[Stack Overflow]]|accessdate=July 1, 2011}}</ref> To get predictable handling of errors and surrogate halves, and to insure UTF-8 is used, it is more common for software to implement their own versions of these functions.
Since insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTf-8 for worldwide language support" checkbox is available for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administative" tab, "Change system locale" button.}} However, this option can break legacy applications as they internally call old "[[DBCS]]" APIs which only support a maximum of 2 bytes in a character, such as IsDBCSLeadByte.
 
Since insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10<ref>{{cite web|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI|url=https://news.ycombinator.com/item?id=15710685|website=Hacker News|accessdate=7 May 2018}}</ref>, a "Beta: Use Unicode UTf-8 for worldwide language support" checkbox is available for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administative" tab, "Change system locale" button.}} However,Assuming thisa optionprocess can breakjust legacyforce applicationsthis asstate theyon internallyitself callat oldstartup, "[[DBCS]]"and APIsthat whichthe onlycompilers supporthave abeen maximumimproved ofto 2not translate bytes infrom athe charactersource when making string constants, suchit ascan IsDBCSLeadBytebe claimed that Windows has solved this problem and now fully supports Unicode.
There are proposals to add an API to portable libraries such as [[Boost (C++ libraries)|Boost]] to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.<ref>{{cite web|url=http://cppcms.com/files/nowide/html/|title=Boost.Nowide}}</ref>
 
Many applications imminently have to support UTF-8 because it is the most-used Unicode encoding scheme in various [[network protocol]]s, including the [[Internet Protocol Suite]]. An application which has to pass UTF-8 to or from a 'W' [[Windows API]] should call the functions [[MultiByteToWideChar]] and WideCharToMultiByte.<ref>{{cite web|url=https://stackoverflow.com/questions/166503/utf-8-in-windows|title=UTF-8 in Windows|publisher=[[Stack Overflow]]|accessdate=July 1, 2011}}</ref> To get predictable handling of errors and surrogate halves it is more common for software to implement their own versions of these functions.
 
==Notes==