Revision as of 10:31, 29 November 2013 edit Spitzak (talk \| contribs) Extended confirmed users 10,503 edits →Windows 9x: Removed incorrect use of the word "Unicode" to mean UTF-16 ← Previous edit		Revision as of 10:41, 29 November 2013 edit undo Spitzak (talk \| contribs) Extended confirmed users 10,503 edits Whoops, multibyte is a third api, described that, and moved the UTF-8 problems to where UTF-8 is discussed Next edit →
Line 1: {{merge from\|Bush hid the facts\|date=November 2013}} {{refimprove\|date=June 2011}} Microsoft started to consistently implement [[Unicode]] in their products quite early.{{clarify\|date=July 2012}} [[Windows NT]] was the first operating system that used ~~Unicode~~"wide characters" in [[system call]]s. Using at first [[UCS-2]] encoding scheme, it was upgraded to [[UTF-16]] starting with [[Windows 2000]], allowing a representation of additional planes with surrogate pairs. == In various Windows families == === Windows NT based systems === Modern operating systems [[Windows XP]] and [[Windows Server 2003]], and prior to them as [[Windows NT 4]] and Windows 2000 are shipped with the [[Windows API\|system libraries]], which supported string [[character encoding\|encoding]] of two types: UTF-16 (often called "Unicode" in Windows documentation) and an 8-bit encoding called the "[[Windows code page\|code page]]" (or incorrectly referred to as ''ANSI code page''). 16-bit functions have names suffixed with -W (from [[wide character\|"wide"]]), for example, lstrlenW(). Code page oriented functions uses suffix -A, e.g., lstrlenA(), for "ANSI". This ~~allows~~split ~~Windows~~was NTnecessary OSbecause ~~family~~many ~~simultaneously~~languages, ~~run~~including ~~programs~~C, ~~capable~~do ofnot ~~using~~provide ~~Unicode~~a byclean ~~using~~way ~~the~~to ~~UTF~~pass both 8-bit and 16-bit strings to the same api, ~~and~~or put them in the same structure. Windows also provides the 'M' API which in some ~~older~~locales 8provided multi-~~bit~~byte ~~encoding~~encodings, but in most locales is the same as 'a'. Most of such "A" and "M"-functions are implemented as a [[Wrapper function\|wrapper]] that translates the code page to UTF-16 and calls the "W" functions. The <code>IsTextUnicode</code> function uses ana [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Notepad (software)\|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)\|"Easter eggs"]] like [[Bush hid the facts]].▼ Although the locale can be set so the "A" encodings handle some multi-byte encodings, it is not possible to set them to support [[UTF-8]]. As many libraries, including the standard C and C++ library, only allow access to files using the "A" api, it is not possible to open all Unicode-named files with them. These libraries could be fixed by making them convert UTF-8 to UTF-16, or the 'a' api improved to accept UTF-8, but Microsoft has so far done neither fix.▼ ▲The <code>IsTextUnicode</code> function uses an [[heuristic algorithm]] on a [[byte string]] passed to it to detect whether this string represents UTF-16 text. For very short texts, this function, used by some applications like [[Notepad (software)\|Notepad]], often gives incorrect results. This gave rise to legends about the existence of [[Easter egg (computing)\|"Easter eggs"]] like [[Bush hid the facts]]. === Windows CE === In [[Windows CE]] UTF-16 was used almost exclusively, with the "A" api mostly missing. {{expand section\|date=June 2011}} Line 20 ⟶ 18: == Various encoding schemes == Although Windows used the UTF-16LE encoding scheme internally, in [[NTFS]] file system, in [[Portable Executable\|executables]] and ~~sometimes~~often in [[text files]], Unicode's [[byte oriented]] encodings [[UTF-8]] and even [[UTF-7]] are supported as well. An application which has to ~~support~~pass UTF-8 or UTF-7 byto ~~the~~or ~~means~~from ofa "w" [[Windows API]] should call the ~~same~~ functions [[MultiByteToWideChar]] and WideCharToMultiByte ~~used to support "legacy" (i.e. pre-Unicode) code pages~~.<ref>{{cite web \|url=http://stackoverflow.com/questions/166503/utf-8-in-windows \|title=UTF-8 in Windows \|publisher=[[Stack Overflow]] \|accessdate=July 1, 2011}}</ref> Many applications imminently have to support UTF-8 because it is the most used of Unicode encoding schemes in various [[network protocol]]s, including the [[Internet Protocol Suite]]. ▲Although the locale can be set so the "AM" encodings handle some multi-byte encodings, it is not possible to set them to support [[UTF-8]] (attempts to use the locale id passed to MultiByteToWideChar for UTF-8 are ignored). As many libraries, including the standard C and C++ library, only allow access to files using the "AM" api, it is not possible to open all Unicode-named files with them. These libraries could be fixed by making them convert UTF-8 to UTF-16, or the 'a' api improved to accept UTF-8, but Microsoft has so far done neither fix. <references />

Unicode in Microsoft Windows: Difference between revisions