Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
UTF-8: No, that DOES NOT WORK for UTF-8!!!!! Please read the previous sentence.
The 'M’ API set is not a thing, so rewrite according to MBCS docs (It's not Unicode, but it still explains the old UTF-8 rejection). Rewrote first paragraph of UTF-8, since Windows 10 1803 apparently has that option now. Also, chcp does work for UTF-8 even before Nov 2017; it was there when WSL came out. Just get a copy of Windows 10 ffs.
Line 5:
 
=== Windows NT based systems ===
Modern Windows versions like [[Windows XP]] and [[Windows Server 2003]], and prior to them [[Windows NT]] (3.x, 4.0) and Windows 2000 are shipped with [[Windows API|system libraries]] which support string [[character encoding|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an 8-bitlocal (sometimes multibyte) encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character|"wide"]]), for example, lstrlenW(). Code page oriented functions use the suffix -A, e.g., lstrlenA(), for "ANSI". This split was necessary because many languages, including C, did not provide a clean way to pass both 8-bit and 16-bit strings to the same function. For the C/C++ langauges however, Windows alsouse provides[[C thepreprocessor]] macros to define a unsuffixed "generic" version that switches between ‘A'M and 'W' APIdepending whichon ina some<code>UNICODE</code> localesmacro.<ref>{{cite providedweb|title=Unicode multi-bytein encodings,the butWindows inAPI|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx|accessdate=7 mostMay locales2018}}</ref><ref>{{cite isweb|title=Conventions thefor sameFunction asPrototypes 'A'(Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|language=en}}</ref> Most such 'A' and 'M' functions are implemented as a [[Wrapper function|wrapper]] that translates the code page to UTF-16 and calls the 'W' function.
 
Independent of the "UNICODE" switch, Windows also provides the "MBCS" API switch.<ref>{{cite web|title=Support for Multibyte Character Sets (MBCSs)|url=https://msdn.microsoft.com/en-us/library/5z097dxa.aspx|language=en}}</ref> This switch turns on some C functions prefixed with<code>_mbs</code>, and selects the 'A' functions for the current locale.<ref>{{cite web|title=Double-byte Character Sets|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317794(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref>
The <code>IsTextUnicode</code> function uses a [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Microsoft Notepad|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)|"Easter eggs"]] like [[Bush hid the facts]].
 
The <code>IsTextUnicode</code> function uses a [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Microsoft Notepad|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)|"Easter eggs"]] like [[Bush hid the facts]].<ref>{{cite web|url=http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx|title=Some files come up strange in Notepad - The Old New Thing|date=March 24, 2007|first=Raymond|last=Chen|website=blogs.msdn.com}}</ref>
 
=== Windows CE ===
Line 18 ⟶ 20:
 
== UTF-8 ==
Microsoft Windows has a code page designated for [[UTF-8]], [[code page 65001]]. Prior to Windows 10 insider build 17035 (November 2017)<ref>{{cite web|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI|url=https://news.ycombinator.com/item?id=15710685|website=Hacker News|accessdate=7 May 2018}}</ref>, it was impossible to set the locale code page to 65001, leaving this code page only available for:
Although the locale can be set so the 'M' encodings handle ''some'' multi-byte encodings, it is not possible to set a locale to use [[UTF-8]] ([[code page 65001]]) which is only used for explicit conversion functions such as MultiByteToWideChar. As many libraries, including the standard C and C++ library, only allow access to files using the 'M' API, it is not possible to open all Unicode-named files with them. Thus Unicode is not supported by Windows in software using a portable API.
 
* Explicit conversion functions such as MultiByteToWideChar
* A manual "chcp" command that only changes the code page for the current program's context. This is used for [[conhost.exe]] windows running [[Windows Subsystem for Linux]].
 
Since insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTf-8 for worldwide language support" checkbox is available for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administative" tab, "Change system locale" button.}} However, this option can break legacy applications as they internally call old "[[DBCS]]" APIs which only support a maximum of 2 bytes in a character, such as IsDBCSLeadByte.
 
There are proposals to add an API to portable libraries such as [[Boost (C++ libraries)|Boost]] to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.<ref>{{cite web|url=http://cppcms.com/files/nowide/html/|title=Boost.Nowide}}</ref>
 
Many applications imminently have to support UTF-8 because it is the most-used Unicode encoding scheme in various [[network protocol]]s, including the [[Internet Protocol Suite]]. An application which has to pass UTF-8 to or from a 'W' [[Windows API]] should call the functions [[MultiByteToWideChar]] and WideCharToMultiByte.<ref>{{cite web|url=https://stackoverflow.com/questions/166503/utf-8-in-windows|title=UTF-8 in Windows|publisher=[[Stack Overflow]]|accessdate=July 1, 2011}}</ref> To get predictable handling of errors and surrogate halves it is more common for software to implement their own versions of these functions.
 
==Notes==
{{notefoot}}
 
== References ==