Content deleted Content added
→History: removed dead link |
|||
(15 intermediate revisions by 14 users not shown) | |||
Line 1:
{{Short description|Data structure}}
{{Redirect|CString||C string (disambiguation)}}
{{see also|String (computer science)#Null-terminated}}
{{Use dmy dates|date=August 2021}}
In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure|array]] containing the characters and terminated with a [[null character]] (a character with a value of zero, called NUL in this article). Alternative names are '''[[C string]]''', which refers to the [[C (programming language)|C programming language]] and '''ASCIIZ'''{{citation needed|date=November 2021}} (although C can use encodings other than ASCII).▼
▲In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure|array]] containing the characters and terminated with a ''[[null character]]'' (a character with
The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not "in" the string).▼
▲The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not
== History ==
Null-terminated strings were produced by the <code>.ASCIZ</code> directive of the [[PDP-11]] [[assembly language]]s and the <code>ASCIZ</code> directive of the [[MACRO-10]] macro assembly language for the [[PDP-10]]. These predate the development of the C programming language, but other forms of strings were often used.
At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "[[String (computer science)#Length-prefixed|length-prefixed]]"), used a leading ''byte'' to store the length of the string. This allows the string to contain NUL and made finding the length
This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family|ES/9000]] 520 in 1992 and the vector string instructions to the [[IBM z13 (microprocessor)|IBM z13]] in 2015.<ref name=pop>[http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf IBM z/Architecture Principles of Operation]</ref>
[[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation |last=Kamp |first=Poul-Henning |date=25 July 2011 |title=The Most Expensive One-byte Mistake |journal=ACM Queue |volume=9 |number=7 |pages=40–43 |doi=10.1145/2001562.2010365 |s2cid=30282393 |issn=1542-7730|
== Limitations ==
Line 25 ⟶ 27:
== Character encodings ==
Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web|title=UTF-8, a transformation format of ISO 10646|date=November 2003 |url=http://tools.ietf.org/html/rfc3629#section-3|access-date=19 September 2013 |last1=Yergeau |first1=François }}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web|title=Unicode/UTF-8-character table|url=http://www.utf8-chartable.de/|access-date=13 September 2013}}</ref><ref>{{cite web|last=Kuhn|first=Markus|title=UTF-8 and Unicode FAQ|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8|access-date=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an [[UTF-8#Overlong encodings|overlong encoding]], and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.
[[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL (0x0000).
|