Null-terminated string: Difference between revisions

Content deleted Content added
Citation bot (talk | contribs)
Alter: title. Add: title, journal, s2cid, doi, pages. Changed bare reference to CS1/2. Removed parameters. Formatted dashes. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by BrownHairedGirl | Linked from User:BrownHairedGirl/Articles_with_bare_links | #UCB_webform_linked 472/2195
History: removed dead link
 
(21 intermediate revisions by 19 users not shown)
Line 1:
{{Short description|Data structure}}
{{Redirect|CString||C string (disambiguation)}}
{{see also|String (computer science)#Null-terminated}}
{{Use dmy dates|date=August 2021}}
In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure|array]] containing the characters and terminated with a [[null character]] (a character with a value of zero, called NUL in this article). Alternative names are '''[[C string]]''', which refers to the [[C (programming language)|C programming language]] and '''ASCIIZ''' (although C can use encodings other than ASCII).
 
In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure|array]] containing the characters and terminated with a ''[[null character]]'' (a character with aan internal value of zero, called "NUL" in this article, not same as the [[glyph]] zero). Alternative names are '''[[C string handling|C string]]''', which refers to the [[C (programming language)|C programming language]] and '''ASCIIZ'''<ref>{{Cite web |title=Chapter 15 - MIPS Assembly Language |url=https://people.scs.carleton.ca/~sivarama/org_book/org_book_web/solution_manual/org_soln_one/arch_book_solution_ch15.pdf |access-date=2023-10-09 |website=Carleton University}}</ref> (although C can use encodings other than [[ASCII]]).
The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not "in" the string).
 
The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not "{{em|in"}} the string).
 
== History ==
Null-terminated strings were produced by the <code>.ASCIZ</code> directive of the [[PDP-11]] [[assembly language]]s and the <code>ASCIZ</code> directive of the [[MACRO-10]] macro assembly language for the [[PDP-10]]. These predate the development of the C programming language, but other forms of strings were often used.
 
At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "[[String (computer science)#Length-prefixed|length-prefixed]]"), used a leading ''byte'' to store the length of the string. This allows the string to contain NUL and made finding the length of an already stored string, need only one memory access (O(1) [[constant time|(constant) time]]), but limited string length to 255 characters (on a machine using 8-bit bytes). C designer [[Dennis Ritchie]] chose to follow the convention of null-termination to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator.<ref>{{cite conference | first = Dennis M. | last = Ritchie | date = April 1993 | ___location = Cambridge, MA | title = The development of the C language | conference = Second History of Programming Languages conference | url = https://www.bell-labs.com/usr/dmr/www/chist.html }}</ref><ref>{{cite book | first = Dennis M. | last = Ritchie | article = The development of the C language | title = History of Programming Languages | edition = 2 | editor-first1= Thomas J. | editor-last1= Bergin, Jr. | editor-first2 = Richard G. | editor-last2 = Gibson, Jr. | publisher = ACM Press | ___location = New York | via = Addison-Wesley (Reading, Mass) | year = 1996 | isbn = 0-201-89502-1 }}</ref>
 
This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family|ES/9000]] 520 in 1992 and the vector string instructions to the [[IBM z13 (microprocessor)|IBM z13]] in 2015.<ref name=pop>[http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf IBM z/Architecture Principles of Operation]</ref>
 
[[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation |last=Kamp |first=Poul-Henning |date=25 July 2011 |title=The Most Expensive One-byte Mistake |journal=ACM Queue |volume=9 |number=7 |pages=40–43 |doi=10.1145/2001562.2010365 |s2cid=30282393 |issn=1542-7730|urldoi-access=http://queue.acm.org/detail.cfm?id=2010365free }}</ref>
 
== Limitations ==
Line 24 ⟶ 27:
 
== Character encodings ==
Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere,; therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web|title=UTF-8, a transformation format of ISO 10646|date=November 2003 |url=http://tools.ietf.org/html/rfc3629#section-3|access-date=19 September 2013 |last1=Yergeau |first1=François }}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web|title=Unicode/UTF-8-character table|url=http://www.utf8-chartable.de/|access-date=13 September 2013}}</ref><ref>{{cite web|last=Kuhn|first=Markus|title=UTF-8 and Unicode FAQ|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8|access-date=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an [[UTF-8#Overlong encodings|overlong encoding]], and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.
 
[[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL (0x0000).
 
== Improvements ==