Unicode in Microsoft Windows

This is an old revision of this page, as edited by Spitzak (talk | contribs) at 10:30, 29 November 2013 (Windows NT based systems). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Microsoft started to consistently implement Unicode in their products quite early.[clarification needed] Windows NT was the first operating system that used Unicode in system calls. Using at first UCS-2 encoding scheme, it was upgraded to UTF-16 starting with Windows 2000, allowing a representation of additional planes with surrogate pairs.

In various Windows families

Windows NT based systems

Modern operating systems Windows XP and Windows Server 2003, and prior to them as Windows NT 4 and Windows 2000 are shipped with the system libraries, which supported string encoding of two types: UTF-16 (often called "Unicode" in Windows documentation) and an 8-bit encoding called the "code page" (or incorrectly referred to as ANSI code page). 16-bit functions have names suffixed with -W (from "wide"), for example, lstrlenW(). Code page oriented functions uses suffix -A, e.g., lstrlenA(), for "ANSI". This allows Windows NT OS family simultaneously run programs capable of using Unicode by using the UTF-16 api, and some older 8-bit encoding. Most of such "A"-functions are implemented as a wrapper that translates the code page to UTF-16 and calls the "W" functions.

Although the locale can be set so the "A" encodings handle some multi-byte encodings, it is not possible to set them to support UTF-8. As many libraries, including the standard C and C++ library, only allow access to files using the "A" api, it is not possible to open all Unicode-named files with them. These libraries could be fixed by making them convert UTF-8 to UTF-16, or the 'a' api improved to accept UTF-8, but Microsoft has so far done neither fix.

The IsTextUnicode function uses an heuristic algorithm on a byte string passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like Notepad, often gives incorrect results. This gave rise to legends about the existence of "Easter eggs" like Bush hid the facts.

Windows CE

In Windows CE UTF-16 was used almost exclusively.

Windows 9x

In 2001, Microsoft released a special supplement to Microsoft’s old Windows 9x systems. It includes a dynamic link library unicows.dll (only 240 KB) containing the Unicode flavor (the ones with the letter W on the end) of all the basic functions of Windows API.

Various encoding schemes

Although Windows used the UTF-16LE encoding scheme internally, in NTFS file system, in executables and sometimes in text files, Unicode's byte oriented encodings UTF-8 and even UTF-7 are supported as well. An application which has to support UTF-8 or UTF-7 by the means of Windows API should call the same functions MultiByteToWideChar and WideCharToMultiByte used to support "legacy" (i.e. pre-Unicode) code pages.[1] Many applications imminently have to support UTF-8 because it is the most used of Unicode encoding schemes in various network protocols, including the Internet Protocol Suite.

  1. ^ "UTF-8 in Windows". Stack Overflow. Retrieved July 1, 2011.