Content deleted Content added
Consistent use, "NUL" == "null character", use null or zero and never say "NUL character" |
|||
Line 1:
{{see also|String (computer science)#Null-terminated}}
In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure|array]] containing the characters and terminated with a [[null character]] (a character with a value of zero, called NUL in this article). Alternative names are '''[[C string]]''', which refers to the [[C (programming language)|C programming language]] and '''ASCIIZ''' (although C can use encodings other than ASCII).
The length of a
== History ==
Null-terminated strings were produced by the <code>.ASCIZ</code> directive of the [[PDP-11]] [[assembly language]]s and the <code>ASCIZ</code> directive of the [[MACRO-10]] macro assembly language for the [[PDP-10]]. These predate the development of the C programming language, but other forms of strings were often used.
At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "[[String (computer science)#Length-prefixed|length-prefixed]]"), used a leading ''byte'' to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) [[constant time|(constant) time]]), but limited string length to 255 characters (on a machine using 8-bit bytes). C designer [[Dennis Ritchie]] chose to follow the convention of
This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the
[[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', would later refer to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation |last=Kamp |first=Poul-Henning |date=25 July 2011 |title=The Most Expensive One-byte Mistake |journal=ACM Queue |volume=9 |number=7 |issn=1542-7730 |accessdate=2 August 2011 |url=http://queue.acm.org/detail.cfm?id=2010365 }}</ref>
Line 16:
While simple to implement, this representation has been prone to errors and performance problems.
The inability to store a
The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(''n'') anyway, such as in <code>[[strlcpy]]</code>. However, this does not always result in an intuitive [[API]].
== Character encodings ==
Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere, therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web|title=UTF-8, a transformation format of ISO 10646|url=http://tools.ietf.org/html/rfc3629#section-3|accessdate=19 September 2013}}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web|title=Unicode/UTF-8-character table|url=http://www.utf8-chartable.de/|accessdate=13 September 2013}}</ref><ref>{{cite web|last=Kuhn|first=Markus|title=UTF-8 and Unicode FAQ|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8|accessdate=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except
[[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL
== Improvements ==
|