Unicode in Microsoft Windows: Difference between revisions

Content deleted Content added
Windows 9x: Removed incorrect use of the word "Unicode" to mean UTF-16
Whoops, multibyte is a *third* api, described that, and moved the UTF-8 problems to where UTF-8 is discussed
Line 1:
{{merge from|Bush hid the facts|date=November 2013}}
{{refimprove|date=June 2011}}
Microsoft started to consistently implement [[Unicode]] in their products quite early.{{clarify|date=July 2012}} [[Windows NT]] was the first operating system that used Unicode"wide characters" in [[system call]]s. Using at first [[UCS-2]] encoding scheme, it was upgraded to [[UTF-16]] starting with [[Windows 2000]], allowing a representation of additional planes with surrogate pairs.
 
== In various Windows families ==
=== Windows NT based systems ===
Modern operating systems [[Windows XP]] and [[Windows Server 2003]], and prior to them as [[Windows NT 4]] and Windows 2000 are shipped with the [[Windows API|system libraries]], which supported string [[character encoding|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an 8-bit encoding called the "[[Windows code page|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character|"wide"]]), for example, lstrlenW(). Code page oriented functions uses suffix -A, e.g., lstrlenA(), for "ANSI". This allowssplit Windowswas NTnecessary OSbecause familymany simultaneouslylanguages, runincluding programsC, capabledo ofnot usingprovide Unicodea byclean usingway theto UTFpass both 8-bit and 16-bit strings to the same api, andor put them in the same structure. Windows also provides the 'M' API which in some olderlocales 8provided multi-bitbyte encodingencodings, but in most locales is the same as 'a'. Most of such "A" and "M"-functions are implemented as a [[Wrapper function|wrapper]] that translates the code page to UTF-16 and calls the "W" functions.
 
The <code>IsTextUnicode</code> function uses ana [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Notepad (software)|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)|"Easter eggs"]] like [[Bush hid the facts]].
Although the locale can be set so the "A" encodings handle some multi-byte encodings, it is not possible to set them to support [[UTF-8]]. As many libraries, including the standard C and C++ library, only allow access to files using the "A" api, it is not possible to open all Unicode-named files with them. These libraries could be fixed by making them convert UTF-8 to UTF-16, or the 'a' api improved to accept UTF-8, but Microsoft has so far done neither fix.
 
The <code>IsTextUnicode</code> function uses an [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Notepad (software)|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)|"Easter eggs"]] like [[Bush hid the facts]].
 
=== Windows CE ===
In [[Windows CE]] UTF-16 was used almost exclusively, with the "A" api mostly missing.
{{expand section|date=June 2011}}
 
Line 20 ⟶ 18:
 
== Various encoding schemes ==
Although Windows used the UTF-16LE encoding scheme internally, in [[NTFS]] file system, in [[Portable Executable|executables]] and sometimesoften in [[text files]], Unicode's [[byte oriented]] encodings [[UTF-8]] and even [[UTF-7]] are supported as well. An application which has to supportpass UTF-8 or UTF-7 byto theor meansfrom ofa "w" [[Windows API]] should call the same functions [[MultiByteToWideChar]] and WideCharToMultiByte used to support "legacy" (i.e. pre-Unicode) code pages.<ref>{{cite web |url=http://stackoverflow.com/questions/166503/utf-8-in-windows |title=UTF-8 in Windows |publisher=[[Stack Overflow]] |accessdate=July 1, 2011}}</ref> Many applications imminently have to support UTF-8 because it is the most used of Unicode encoding schemes in various [[network protocol]]s, including the [[Internet Protocol Suite]].
 
Although the locale can be set so the "AM" encodings handle some multi-byte encodings, it is not possible to set them to support [[UTF-8]] (attempts to use the locale id passed to MultiByteToWideChar for UTF-8 are ignored). As many libraries, including the standard C and C++ library, only allow access to files using the "AM" api, it is not possible to open all Unicode-named files with them. These libraries could be fixed by making them convert UTF-8 to UTF-16, or the 'a' api improved to accept UTF-8, but Microsoft has so far done neither fix.
 
<references />