Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
U2718 (talk | contribs)
Descripted behaviour dates to 2012
String constants: String constants in VS; other compilers may differ.
 
(9 intermediate revisions by 4 users not shown)
Line 32:
In May 2019, Microsoft added the ability for a program to set the code page to UTF-8 itself,<ref name="Microsoft-UTF-8">{{cite web|title=Use UTF-8 code pages in Windows apps|url=https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page |access-date=2020-06-06 |quote=As of Windows version&nbsp;1903 (May&nbsp;2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [...] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows version&nbsp;1903 (May&nbsp;2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly. |website=learn.microsoft.com |language=en-us}}</ref><ref>{{cite web|url=https://skanthak.homepage.t-online.de/quirks.html#quirk31|title=Windows 10 1903 and later versions finally support UTF-8 with the A forms of the Win32 functions}}</ref> allowing programs written to use UTF-8 to be run by non-expert users.
 
{{As of|2019}}, Microsoft recommends programmers use UTF-8 (e.g. instead of any other 8-bit encoding),<ref name="Microsoft-UTF-8">{{cite web|title=Use UTF-8 code pages in Windows apps|url=https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page |access-date=2020-06-06 |quote=As of Windows version&nbsp;1903 (May&nbsp;2019 update), you can use the ActiveCodePage property in the appxmanifest for packaged apps, or the fusion manifest for unpackaged apps, to force a process to use UTF-8 as the process code page. [...] <code>CP_ACP</code> equates to <code>CP_UTF8</code> only if running on Windows version&nbsp;1903 (May&nbsp;2019 update) or above and the ActiveCodePage property described above is set to UTF-8. Otherwise, it honors the legacy system code page. We recommend using <code>CP_UTF8</code> explicitly. |website=learn.microsoft.com |language=en-us}}</ref> on Windows and [[Xbox]], and may be recommending its use instead of UTF-16, even stating "UTF-8 is the universal code page for internationalization [and] UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms."<ref name="Microsoft GDK">{{Cite web |title=UTF-8 support in the Microsoft Game Development Kit (GDK) - Microsoft Game Development Kit |url=https://learn.microsoft.com/en-us/gaming/gdk/_content/gc/system/overviews/utf-8 |access-date=2023-03-05 |website=learn.microsoft.com |date=19 August 2022 |language=en-us |quote=By operating in UTF-8, you can ensure maximum compatibility [..] Windows operates natively in UTF-16 (or WCHAR), which requires code page conversions by using MultiByteToWideChar and WideCharToMultiByte. This is a unique burden that Windows places on code that targets multiple platforms. [..] The Microsoft Game Development Kit (GDK) and Windows in general are moving forward to support UTF-8 to remove this unique burden of Windows on code targeting or interchanging with multiple platforms and the web. Also, this results in fewer internationalization issues in apps and games and reduces the test matrix that's required to get it right.}}</ref> Microsoft does appear to be transitioning to UTF-8, stating it previously emphasized its alternative, and in [[Windows 11]] some system files are required to use UTF-8 and do not require a Byte Order Mark.<ref>{{Cite web|title=Customize the Windows 11 Start menu|url=https://docs.microsoft.com/en-us/windows-hardware/customize/desktop/customize-the-windows-11-start-menu|access-date=2021-06-29|website=docs.microsoft.com|language=en-us|quote=Make sure your LayoutModification.json uses UTF-8 encoding.}}</ref> Notepad can now recognize UTF-8 without the Byte Order Mark, and can be told to write UTF-8 without a Byte Order Mark.{{cn|date=November 2022}} Some other Microsoft products are using UTF-8 internally, including Visual Studio<ref>{{cncite web|title=New Options for Managing Character Sets in the Microsoft C/C++ Compiler|url=https://devblogs.microsoft.com/cppblog/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/#how-the-microsoft-c/c++-compiler-reads-text-from-a-file|website=devblogs.microsoft.com| date=November22 February 2016 |language=en |quote=At some point in the past, the Microsoft compiler was changed to use UTF-8 internally. So, as files are read from disk, they are converted into UTF-8 on the fly.}}</ref><ref>{{ cite web | title=validate-charset (validate for compatible characters) | website=docs.microsoft.com |language=en-us | url=https://docs.microsoft.com/en-us/cpp/build/reference/validate-charset-validate-for-compatible-characters | access-date=2021-07-19 | quote=Visual Studio uses UTF-8 as the internal character encoding during conversion between the source character set and the execution character set. 2022}}</ref> and their [[SQL Server 2019]], with Microsoft claiming 35% speed increase from use of UTF-8, and "nearly 50% reduction in storage requirements."<ref>{{Cite web|date=2019-07-02|title=Introducing UTF-8 support for SQL Server|url=https://techcommunity.microsoft.com/t5/sql-server/introducing-utf-8-support-for-sql-server/ba-p/734928|quote=For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50% reduction in storage requirements. [..] In the ASCII range, when doing intensive read/write I/O on UTF-8<!-- " " in quote, but ok to strip-->, we measured an average 35% performance improvement over UTF-16 using clustered tables with a non-clustered index on the string column, and an average 11% performance improvement over UTF-16 using a heap. |access-date=2021-08-24|website=techcommunity.microsoft.com|language=en}}</ref>
 
=== String constants in Visual Studio ===
=== Programming platforms ===
AroundBefore 20122019 Microsoft's compilers could not produce UTF-8 string constants from UTF-8 source files. This is due to them converting all strings to the locale code page (which could not be UTF-8). At one time the only method to work around this was to turn ''off'' {{tt|UNICODE}}, and ''not'' mark the input file as being UTF-8 (i.e. do not use a [[UTF-8#Byte order mark|BOM]]).<ref>[http://utf8everywhere.org/#faq.literal UTF-8 Everywhere FAQ: How do I write UTF-8 string literal in my C++ code?]</ref> (note that the {{Obsolete sourcett|reason=Thisu8"text"}} sourceproposed seemssolution moredoes thannot 10work, yearsstring old.is (Citingstill Unicode 6.2mangled). Can this contents be verified?|date=July 2024}}</ref> This would make the compiler think both the input and outputs were in the same single-byte locale, and leave strings unmolested. On modern systems setting the code page to UTF-8 helps, but there are still problems using {{code|\x}} to get individual bytes into the UTF-8.{{Citation needed|date=July 2024|reason=Not sure if this is true}}
 
== See also ==