Content deleted Content added
→Windows NT based systems: Remove pointless bloat, this is true of all interesting calls to anybody attempting to use UTF-8 |
Guy Harris (talk | contribs) →String constants: String constants in VS; other compilers may differ. |
||
(36 intermediate revisions by 11 users not shown) | |||
Line 1:
{{Short description|Overview on Unicode implementation in Microsoft Windows}}
{{more citations needed|date=June 2011}}
[[Microsoft]] was one of the first companies to implement [[Unicode]] in their products. [[Windows NT]] was the first operating system that used "wide characters" in [[system call]]s. Using the (now obsolete) [[UCS-2]] encoding scheme at first, it was upgraded to the [[variable-width encoding]] [[UTF-16]] starting with [[Windows 2000]], allowing a representation of additional planes with surrogate pairs. However Microsoft did not support [[UTF-8]] in its API until May 2019
Before 2019, Microsoft emphasized UTF-16 (i.e. -W API), but has since recommended to use [[UTF-8]] (at least in some cases),<ref name="Microsoft-UTF-8" /> on Windows and [[Xbox]] (and in other of its products), even states "UTF-8 is the universal code page for internationalization [and] UTF-16 [... is] a unique burden that Windows places on code that targets multiple platforms. [..] Windows [is] moving forward to support UTF-8 to remove this unique burden [resulting] in fewer internationalization issues in apps and games".<ref name="Microsoft GDK" />
A large amount of Microsoft documentation uses the word "Unicode" to refer explicitly to the UTF-16 encoding. Anything else, including UTF-8, is not "Unicode".▼
▲A large amount of Microsoft documentation uses the word "Unicode" to refer explicitly to the UTF-16 encoding. Anything else, including UTF-8, is not "Unicode" in Microsoft's outdated language (while UTF-8 and UTF-16 are both Unicode according to [[Unicode|the Unicode Standard]], or encodings/"transformation formats" thereof).
== In various Windows families ==
Line 8 ⟶ 11:
=== Windows NT based systems ===
Current Windows versions and all back to [[Windows XP]] and prior [[Windows NT]] (3.x, 4.0) are shipped with [[Windows API|system libraries]] that support string [[character encoding|encoding]] of two types: 16-bit "Unicode" ([[UTF-16]] since [[Windows 2000]]) and a (sometimes multibyte) encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''[[American National Standards Institute|ANSI]] code page''). 16-bit functions have names suffixed with 'W' (from [[wide character|"wide"]]) such as <code>SetWindowTextW</code>. Code page oriented functions use the suffix 'A' for "ANSI" such as <code>SetWindowTextA</code> (some other conventions were used for APIs that were copied from other systems, such as <code>_wfopen/fopen</code> or <code>wcslen/strlen</code>). This split was necessary because many languages, including [[C (programming language)|C]], did not provide a clean way to pass both 8-bit and 16-bit strings to the same function.
[[Microsoft]] attempted to support Unicode "portably" by providing a "UNICODE" switch to the compiler, that switches unsuffixed "generic" calls from the 'A' to the 'W' interface and converts all string constants to "wide" UTF-16 versions.<ref>{{cite web|title=Unicode in the Windows API|url=https://msdn.microsoft.com/en-us/library/dd374089.aspx|access-date=7 May 2018}}</ref><ref>{{cite web|title=Conventions for Function Prototypes (Windows)|url=https://msdn.microsoft.com/en-us/library/dd317766.aspx|website=MSDN|access-date=7 May 2018|language=en}}</ref> This does not actually work because it does not translate UTF-8 outside of string constants, resulting in code that attempts to open files just not compiling.{{citation needed|date=October 2019}}
Line 16 ⟶ 17:
=== Windows CE ===
In (the now discontinued) [[Windows CE]], UTF-16 was used almost exclusively, with the 'A' API mostly missing.<ref>{{cite web|title=Differences Between the Windows CE and Windows NT Implementations of TAPI|url=https://msdn.microsoft.com/en-us/library/aa454022.aspx|website=MSDN|date=28 August 2006 |access-date=7 May 2018|quote=Windows CE is Unicode-based. You might have to recompile source code that was written for a Windows NT-based application.}}</ref> A limited set of ANSI API is available in Windows CE 5.0, for use on a reduced set of locales that may be selectively built onto the runtime image.<ref>{{cite web|title=Code Pages (Windows CE 5.0)|url=https://docs.microsoft.com/en-us/previous-versions/windows/embedded/ms903783(v=msdn.10)|website=Microsoft Docs| date=14 September 2012 |access-date=7 May 2018|language=en-us}}</ref>
=== Windows 9x ===
{{Main article|Microsoft Layer for Unicode}}
In 2001, Microsoft released a special supplement to Microsoft's old [[Windows 9x]] systems. It includes a dynamic link library, 'unicows.dll', (only 240
== UTF-8 ==
Microsoft Windows ([[Windows XP]] and later) has a code page designated for [[UTF-8]], code page 65001<ref>{{cite web|title=Code Page Identifiers (Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx|website=msdn.microsoft.com| date=7 January 2021 |language=en}}</ref> or <code>CP_UTF8</code>. For a long time, it was impossible to set the locale code page to 65001, leaving this code page only available for
In April 2018 (or possibly November 2017<ref>{{cite web|title=Windows10 Insider Preview Build 17035 Supports UTF-8 as ANSI|url=https://news.ycombinator.com/item?id=15710685|website=Hacker News|access-date=7 May 2018}}</ref>), with insider build 17035 (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administrative" tab, "Change system locale" button.}} This allows for calling "narrow" functions, including <code>fopen</code> and <code>SetWindowTextA</code>, with UTF-8 strings. However this is a system-wide setting and a program cannot assume it is set.
In May 2019, Microsoft added the ability for a program to set the code page to UTF-8 itself,<ref name="Microsoft-UTF-8">{{
=== String constants in Visual Studio ===
Before 2019 Microsoft's compilers
== See also ==
|