Talk:Unicode/Archive 4: Difference between revisions

Content deleted Content added
new archive for talk page
 
Fix Linter errors.
 
Line 194:
 
The Java situation:
<cite>Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.</cite>
<cite>
Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.
</cite>
:::http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp
The Microsoft OS situation
 
<cite>Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.</cite>
<cite>
Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.
</cite>
:::http://windowssdk.msdn.microsoft.com/en-us/library/ms776414.aspx
The MS SQL server situation
*<cite>Since these characters’ surrogate pairs are considered two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (i.e. space for a surrogate pair)</cite>
*<cite>
Since these characters’ surrogate pairs are considered two separate Unicode code points, the size of nvarchar(n) needs to be 2 to hold a single supplementary character (i.e. space for a surrogate pair)
</cite>
*<cite>String operations are not supplementary character aware. Thus operations such as Substring(nvarchar(2),1,1) will result in only the high surrogate of the supplementary characters surrogate pair. Also the Len operation will return the count of two characters for every supplementary character encountered – one for the high surrogate and one for the low surrogate.</cite>
*<cite>In sorting and searching, all supplementary characters compare equal to all other supplementary characters</cite>
</cite>
:::http://www.microsoft.com/globaldev/DrIntl/columns/021/default.mspx#EHD
[[User:Pjacobi|Pjacobi]] 18:37, 6 November 2006 (UTC)
Line 227 ⟶ 220:
::Good points, Plugwash. (Actually, sorting on 16-bit word values is exactly equivalent to sorting by codepoint. The surrogate stuff is very well designed. The only bad thing about it is the name "surrogate", IMO.) OTOH, sorting by raw codepoint is very user-hostile, and locale-specific collated sorts written for UCS-2 will mess up on non-BMP codepoints. (Aside: the variety of rules different cultures use for sorting is quite striking.)
::Talking about simple concepts of strings, not only is the concept of string length dead, the concept of a character is on its deathbed as well. Good programmers should no longer write code that treats strings as sequences of characters; instead, strings should be treated as sequences of codepoints (the low-level view) or sequences of graphemes (the medium level view) or sequences of higher-level units (words, lines, etc).
::This is why JSR 204 can get away with retaining <ttcode>char</ttcode> as a 16-bit type and storing non-BMP codepoints as a surrogate pair. Code that processes strings character by character has to be rewritten to use <ttcode>CodePointAt</ttcode> and similar methods which JSR 204 added to <ttcode>java.lang.String</ttcode> and <ttcode>java.lang.StringBuffer</ttcode>, but it's better to use a higher-level [[International Components for Unicode|ICU4J]] facility such as <ttcode>BreakIterator</ttcode>. (See also the brief rationale for JSR 204 in [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/ ''Supplementary Characters in the Java Platform''].) The days when any competent programmer could write production-quality text-processing tools from scratch are over.
::Thanks also to Pjacobi for those very useful links above.
::Going back to the original question, my answer is that UCS-2 ''is'' obsolete (or at least becoming obsolete), but many systems written to store and (to a lesser extent) process UCS-2 text are not. Of course, this is a fairly narrow distinction.