Null-terminated string: Difference between revisions

Content deleted Content added
Consistent use, "NUL" == "null character", use null or zero and never say "NUL character"
Monkbot (talk | contribs)
m Task 18 (cosmetic): eval 5 templates: hyphenate params (5×);
Line 11:
This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family|ES/9000]] 520 in 1992.
 
[[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', would later refer to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation |last=Kamp |first=Poul-Henning |date=25 July 2011 |title=The Most Expensive One-byte Mistake |journal=ACM Queue |volume=9 |number=7 |issn=1542-7730 |accessdateaccess-date=2 August 2011 |url=http://queue.acm.org/detail.cfm?id=2010365 }}</ref>
 
== Limitations ==
While simple to implement, this representation has been prone to errors and performance problems.
 
Null-termination has historically created [[computer insecurity|security problems]].<ref>{{cite journal|url= http://insecure.org/news/P55-07.txt |author=Rain Forest Puppy |title=Perl CGI problems |work=Phrack Magazine |publisher=artofhacking.com |date=9 September 1999 |volume=9 |issue=55 |page=7 |accessdateaccess-date=3 January 2016}}</ref> A NUL inserted into the middle of a string will truncate it unexpectedly.<ref>https://security.stackexchange.com/questions/48187/null-byte-injection-on-php</ref> A common bug was to not allocate the additional space for the NUL, so it was written over adjacent memory. Another was to not write the NUL at all, which was often not detected during testing because the block of memory already contained zeros. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size [[Data buffer|buffer]], causing a [[buffer overflow]] if it was too long.
 
The inability to store a zero requires that text and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.
Line 23:
 
== Character encodings ==
Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere, therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web|title=UTF-8, a transformation format of ISO 10646|url=http://tools.ietf.org/html/rfc3629#section-3|accessdateaccess-date=19 September 2013}}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web|title=Unicode/UTF-8-character table|url=http://www.utf8-chartable.de/|accessdateaccess-date=13 September 2013}}</ref><ref>{{cite web|last=Kuhn|first=Markus|title=UTF-8 and Unicode FAQ|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8|accessdateaccess-date=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an [[UTF-8#Overlong_encodings|overlong encoding]], and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.
 
[[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL