Null-terminated string: Difference between revisions

Content deleted Content added
Line 23:
 
== Character encodings ==
Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere, therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web|title=UTF-8, a transformation format of ISO 10646|url=http://tools.ietf.org/html/rfc3629#section-3|accessdate=19 September 2013}}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web|title=Unicode/UTF-8-character table|url=http://www.utf8-chartable.de/|accessdate=13 September 2013}}</ref><ref>{{cite web|last=Kuhn|first=Markus|title=UTF-8 and Unicode FAQ|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8|accessdate=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except the NUL character – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes the NUL character as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an [[UTF-8#Overlong_encodings|overlong encoding]], and it is seen as a security risk. A 0xC0, 0x80 NUL might be seen as a string terminator in security validation and as a character when used. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8 (but are also invalid code units!).
 
[[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL character. (Again the NUL character, which encodes as a single zero code unit, is the only character that cannot be stored. UTF-16 does not have any alternative encoding of zero).