Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
Windows NT based systems: This seems really questionalbe, removed non-existent "lstrlenW" (the l indicates wide!). Not sure of name but "fopenW" would be a better example
Line 5:
 
=== Windows NT based systems ===
{{issues}}
Modern Windows versions like [[Windows XP]] and [[Windows Server 2003]], and prior to them [[Windows NT]] (3.x, 4.0) and Windows 2000 are shipped with [[Windows API|system libraries]] which support string [[character encoding|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an local (sometimes multibyte) encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character|"wide"]]), for example, lstrlenW(). Code page oriented functions use the suffix -A, e.g., lstrlenA(), for "ANSI". This split was necessary because many languages, including C, did not provide a clean way to pass both 8-bit and 16-bit strings to the same function. For the C/C++ langauges however, Windows use [[C preprocessor]] macros to define a unsuffixed "generic" version that switches between ‘A' and 'W' depending on a <code>UNICODE</code> macro.<ref>{{cite web|title=Unicode in the Windows API|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx|accessdate=7 May 2018}}</ref><ref>{{cite web|title=Conventions for Function Prototypes (Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|language=en}}</ref> Most such 'A' functions are implemented as a [[Wrapper function|wrapper]] that translates the code page to UTF-16 and calls the 'W' function.
 
Modern Windows versions like [[Windows XP]] and [[Windows Server 2003]], and prior to them [[Windows NT]] (3.x, 4.0) and Windows 2000 are shipped with [[Windows API|system libraries]] which support string [[character encoding|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an local (sometimes multibyte) encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character|"wide"]]), for example, lstrlenW(). Code page oriented functions use the suffix -A, e.g., lstrlenA(), for "ANSI". This split was necessary because many languages, including C, did not provide a clean way to pass both 8-bit and 16-bit strings to the same function. For the C/C++ langauges however, Windows use [[C preprocessor]] macros to define a unsuffixed "generic" version that switches between ‘A' and 'W' depending on a <code>UNICODE</code> macro.<ref>{{cite web|title=Unicode in the Windows API|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx|accessdate=7 May 2018}}</ref><ref>{{cite web|title=Conventions for Function Prototypes (Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|language=en}}</ref> Most such 'A' functions are implemented as a [[Wrapper function|wrapper]] that translates the code page to UTF-16 and calls the 'W' function.
Independent of the "UNICODE" switch, Windows also provides the "MBCS" API switch.<ref>{{cite web|title=Support for Multibyte Character Sets (MBCSs)|url=https://msdn.microsoft.com/en-us/library/5z097dxa.aspx|language=en}}</ref> This switch turns on some C functions prefixed with<code>_mbs</code>, and selects the 'A' functions for the current locale.<ref>{{cite web|title=Double-byte Character Sets|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317794(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref>
 
Microsoft attempted to support Unicode "portably" by providing a "UNICODE" switch to the compiler, that switches unsiffixed "generic" calls from the 'A' to the 'W' interface and converts all string constants to "wide" UTF-16 versions.<ref>{{cite web|title=Unicode in the Windows API|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx|accessdate=7 May 2018}}</ref><ref>{{cite web|title=Conventions for Function Prototypes (Windows)|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317766(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|language=en}}</ref>
 
IndependentEarlier, and independent of the "UNICODE" switch, Windows also provides the "MBCS" API switch.<ref>{{cite web|title=Support for Multibyte Character Sets (MBCSs)|url=https://msdn.microsoft.com/en-us/library/5z097dxa.aspx|language=en}}</ref> This switch turns on some C functions prefixed with<code>_mbs</code>, and selects the 'A' functions for the current locale.<ref>{{cite web|title=Double-byte Character Sets|url=https://msdn.microsoft.com/en-us/library/windows/desktop/dd317794(v=vs.85).aspx|website=MSDN|accessdate=7 May 2018|quote=our applications use DBCS Windows code pages with the "A" versions of Windows functions.}}</ref>
 
The <code>IsTextUnicode</code> function uses a [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Microsoft Notepad|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)|"Easter eggs"]] like [[Bush hid the facts]].<ref>{{cite web|url=http://blogs.msdn.com/b/oldnewthing/archive/2004/03/24/95235.aspx|title=Some files come up strange in Notepad - The Old New Thing|date=March 24, 2007|first=Raymond|last=Chen|website=blogs.msdn.com}}</ref>