Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
m added comma after date
Removed references to a boost library that is off topic, and corrected the reading of "explicitly" in MS announcement as "exclusively"
Tag: references removed
Line 1:
{{more citations needed|date=June 2011}}
[[Microsoft]] was one of the first companies to implement [[Unicode]] in their products. [[Windows NT]] was the first operating system that used "wide characters" in [[system call]]s. Using the [[UCS-2]] encoding scheme at first, it was upgraded to [[UTF-16]] starting with [[Windows 2000]], allowing a representation of additional planes with surrogate pairs. Nevertheless, Microsoft failed to support [[UTF-8]] until 2017. In May 2019, Microsoft reversed course and started recommending using UTF-8 exclusively.<ref name="Microsoft-UTF-8" /> In [[Windows 11]] some system files are required to use UTF-8.<ref>{{Cite web|last=themar-msft|title=Customize the Windows 11 Start menu|url=https://docs.microsoft.com/en-us/windows-hardware/customize/desktop/customize-the-windows-11-start-menu|access-date=2021-06-29|website=docs.microsoft.com|language=en-us|quote=Make sure your LayoutModification.json uses UTF-8 encoding.}}</ref>
 
== In various Windows families ==
Line 12:
 
Earlier, and independent of the "UNICODE" switch, Windows also provided the Multibyte Character Sets (MBCS) API switch.<ref>{{cite web|title=Support for Multibyte Character Sets (MBCSs)|url=https://docs.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=vs-2019|access-date=2020-06-15|language=en}}</ref> This changes some functions that don't work in MBCS such as <code>strrev</code> to an MBCS-aware one such as <code>_mbsrev</code>.<ref>{{cite web|title=Double-byte Character Sets|url=https://docs.microsoft.com/en-us/windows/win32/intl/double-byte-character-sets|website=MSDN|access-date=2020-06-15|date=2018-05-31|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref><ref>[https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/strrev-wcsrev-mbsrev-mbsrev-l _strrev, _wcsrev, _mbsrev, _mbsrev_l] Microsoft Docs</ref>
 
Microsoft documentation uses the term "Unicode" to mean "not 8-bit encoding".{{citation needed|date=June 2020}}
 
=== Windows CE ===
Line 28 ⟶ 26:
Microsoft said that a UTF-8 locale might break ''some'' functions as they were written to assume multibyte encodings used no more than 2 bytes per character, thus code pages with more bytes such as UTF-8 (and also [[GB 18030]], cp54936) could not be set as the locale.<ref>[https://social.msdn.microsoft.com/Forums/vstudio/en-US/99f4b004-90d5-4519-b2c4-90aa6e7f128d/setlocale-problem-with-code-page-65001?forum=vclanguage MSDN forums]</ref>
 
In April 2018, with insider build 17035 (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administrative" tab, "Change system locale" button.}} This allows for calling "narrow" functions, including <code>fopen</code> and <code>SetWindowTextA</code>, with UTF-8 strings. In May 2019 Microsoft added the ability for a program to set the code page to UTF-8 itself,. and started recommending that all software do this and use UTF-8 exclusively.<ref name="Microsoft-UTF-8">{{Cite web|title=Use the Windows UTF-8 code page - UWP applications|url=https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page|access-date=2020-06-06|quote=As of Windows Version 1903 (May 2019 Update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [..] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows Version 1903 (May 2019 Update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly.|website=docs.microsoft.com|language=en-us}}</ref>
On all modern non-Windows platforms, the file-name string passed to <code>fopen</code> is effectively UTF-8. This produces an incompatibility between other platforms and Windows. The normal work-around is to add Windows-specific code to convert UTF-8 to UTF-16 using [[MultiByteToWideChar]] and call the "wide" function instead of <code>fopen</code>.<ref>{{cite web|url=https://stackoverflow.com/questions/166503/utf-8-in-windows|title=UTF-8 in Windows|publisher=[[Stack Overflow]]|access-date=July 1, 2011}}</ref> Another popular work-around is to convert the name to the [[8.3 filename]] equivalent, this is necessary if the <code>fopen</code> is inside a library function that takes a string filename and thus calling another function is not possible. There were also proposals to add new APIs to portable libraries such as [[Boost (C++ libraries)|Boost]] to do the necessary conversion, by adding new functions for opening and renaming files. These functions would pass filenames through unchanged on Unix, but translate them to UTF-16 on Windows. Such a library, Boost.Nowide,<ref>{{cite web|url=https://github.com/boostorg/nowide|title=Boost.Nowide}}</ref> was accepted into Boost<ref>{{cite web|url=https://lists.boost.org/boost-announce/2017/06/0516.php|title=Boost mailing list}}</ref> and will be part of the 1.73 release.{{Needs update|date=March 2021|reason=this cites a 2017 mailing list post, has it been released?}} This would allow code to be "portable", but required just as many code changes as calling the wide functions.
 
In April 2018, with insider build 17035 (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8.{{efn|1=Found under control panel, "Region" entry, "Administrative" tab, "Change system locale" button.}} This allows for calling "narrow" functions, including <code>fopen</code> and <code>SetWindowTextA</code>, with UTF-8 strings. In May 2019 Microsoft added the ability for a program to set the code page to UTF-8 itself, and started recommending that all software do this and use UTF-8 exclusively.<ref name="Microsoft-UTF-8">{{Cite web|title=Use the Windows UTF-8 code page - UWP applications|url=https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page|access-date=2020-06-06|quote=As of Windows Version 1903 (May 2019 Update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [..] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows Version 1903 (May 2019 Update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly.|website=docs.microsoft.com|language=en-us}}</ref>
 
=== Programming platforms ===