Revision as of 02:50, 13 April 2018 edit Spitzak (talk \| contribs) Extended confirmed users 10,503 edits →UTF-8: No, that DOES NOT WORK for UTF-8!!!!! Please read the previous sentence. ← Previous edit		Revision as of 14:45, 7 May 2018 edit undo Artoria2e5 (talk \| contribs) Extended confirmed users, IP block exemptions 38,960 edits The 'M’ API set is not a thing, so rewrite according to MBCS docs (It's not Unicode, but it still explains the old UTF-8 rejection). Rewrote first paragraph of UTF-8, since Windows 10 1803 apparently has that option now. Also, chcp does work for UTF-8 even before Nov 2017; it was there when WSL came out. Just get a copy of Windows 10 ffs. Next edit →
Line 5: === Windows NT based systems === Modern Windows versions like [[Windows XP]] and [[Windows Server 2003]], and prior to them [[Windows NT]] (3.x, 4.0) and Windows 2000 are shipped with [[Windows API\|system libraries]] which support string [[character encoding\|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an ~~8-bit~~local (sometimes multibyte) encoding called the "[[Windows code page\|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character\|"wide"]]), for example, lstrlenW(). Code page oriented functions use the suffix -A, e.g., lstrlenA(), for "ANSI". This split was necessary because many languages, including C, did not provide a clean way to pass both 8-bit and 16-bit strings to the same function. For the C/C++ langauges however, Windows ~~also~~use ~~provides~~[[C ~~the~~preprocessor]] macros to define a unsuffixed "generic" version that switches between ‘A'M and 'W' ~~API~~depending ~~which~~on ina ~~some~~<code>UNICODE</code> ~~locales~~macro.<ref>{{cite ~~provided~~web\|title=Unicode ~~multi-byte~~in ~~encodings,~~the ~~but~~Windows inAPI\|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx\|accessdate=7 ~~most~~May ~~locales~~2018}}</ref><ref>{{cite isweb\|title=Conventions ~~the~~for ~~same~~Function asPrototypes ~~'A'~~(Windows)\|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx\|website=MSDN\|accessdate=7 May 2018\|language=en}}</ref> Most such 'A~~' and 'M~~' functions are implemented as a [[Wrapper function\|wrapper]] that translates the code page to UTF-16 and calls the 'W' function. Independent of the "UNICODE" switch, Windows also provides the "MBCS" API switch.<ref>{{cite web\|title=Support for Multibyte Character Sets (MBCSs)\|url=https://msdn.microsoft.com/en-us/library/5z097dxa.aspx\|language=en}}</ref> This switch turns on some C functions prefixed with<code>_mbs</code>, and selects the 'A' functions for the current locale.<ref>{{cite web\|title=Double-byte Character Sets\|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317794(v=vs.85).aspx\|website=MSDN\|accessdate=7 May 2018\|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref> The <code>IsTextUnicode</code> function uses a [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Microsoft Notepad\|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)\|"Easter eggs"]] like [[Bush hid the facts]].▼ ▲The <code>IsTextUnicode</code> function uses a [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Microsoft Notepad\|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)\|"Easter eggs"]] like [[Bush hid the facts]].<ref>{{cite web\|url=http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx\|title=Some files come up strange in Notepad - The Old New Thing\|date=March 24, 2007\|first=Raymond\|last=Chen\|website=blogs.msdn.com}}</ref> === Windows CE === Line 18 ⟶ 20: == UTF-8 == Microsoft Windows has a code page designated for [[UTF-8]], [[code page 65001]]. Prior to Windows 10 insider build 17035 (November 2017)<ref>{{cite web\|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI\|url=https://news.ycombinator.com/item?id=15710685\|website=Hacker News\|accessdate=7 May 2018}}</ref>, it was impossible to set the locale code page to 65001, leaving this code page only available for: Although the locale can be set so the 'M' encodings handle ''some'' multi-byte encodings, it is not possible to set a locale to use [[UTF-8]] ([[code page 65001]]) which is only used for explicit conversion functions such as MultiByteToWideChar. As many libraries, including the standard C and C++ library, only allow access to files using the 'M' API, it is not possible to open all Unicode-named files with them. Thus Unicode is not supported by Windows in software using a portable API. * Explicit conversion functions such as MultiByteToWideChar * A manual "chcp" command that only changes the code page for the current program's context. This is used for [[conhost.exe]] windows running [[Windows Subsystem for Linux]]. Since insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTf-8 for worldwide language support" checkbox is available for setting the locale code page to UTF-8.{{efn\|1=Found under control panel, "Region" entry, "Administative" tab, "Change system locale" button.}} However, this option can break legacy applications as they internally call old "[[DBCS]]" APIs which only support a maximum of 2 bytes in a character, such as IsDBCSLeadByte. There are proposals to add an API to portable libraries such as [[Boost (C++ libraries)\|Boost]] to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows.<ref>{{cite web\|url=http://cppcms.com/files/nowide/html/\|title=Boost.Nowide}}</ref> Many applications imminently have to support UTF-8 because it is the most-used Unicode encoding scheme in various [[network protocol]]s, including the [[Internet Protocol Suite]]. An application which has to pass UTF-8 to or from a 'W' [[Windows API]] should call the functions [[MultiByteToWideChar]] and WideCharToMultiByte.<ref>{{cite web\|url=https://stackoverflow.com/questions/166503/utf-8-in-windows\|title=UTF-8 in Windows\|publisher=[[Stack Overflow]]\|accessdate=July 1, 2011}}</ref> To get predictable handling of errors and surrogate halves it is more common for software to implement their own versions of these functions. ==Notes== {{notefoot}} == References ==

Unicode in Microsoft Windows: Difference between revisions