Null-terminated string: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 07:57, 17 June 2022 edit 2a01:119f:253:7000:cd3b:9d79:d8e:6a0f (talk) →Character encodings ← Previous edit		Latest revision as of 01:23, 25 March 2025 edit undo Eniagrom (talk \| contribs) Extended confirmed users 1,363 edits →History: removed dead link
(15 intermediate revisions by 14 users not shown)
Line 1: {{Short description\|Data structure}} ~~{{Redirect\|CString\|the garment\|Thong}}~~ {{Redirect\|CString\|\|C string (disambiguation)}} {{see also\|String (computer science)#Null-terminated}} {{Use dmy dates\|date=August 2021}} In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure\|array]] containing the characters and terminated with a [[null character]] (a character with a value of zero, called NUL in this article). Alternative names are '''[[C string]]''', which refers to the [[C (programming language)\|C programming language]] and '''ASCIIZ'''{{citation needed\|date=November 2021}} (although C can use encodings other than ASCII).▼ ▲In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure\|array]] containing the characters and terminated with a ''[[null character]]'' (a character with aan internal value of zero, called "NUL" in this article, not same as the [[glyph]] zero). Alternative names are '''[[C string handling\|C string]]''', which refers to the [[C (programming language)\|C programming language]] and '''ASCIIZ'''<ref>{{~~citation~~Cite web ~~needed~~\|title=Chapter 15 - MIPS Assembly Language \|url=https://people.scs.carleton.ca/~sivarama/org_book/org_book_web/solution_manual/org_soln_one/arch_book_solution_ch15.pdf \|access-date=~~November~~2023-10-09 \|website=Carleton ~~2021~~University}}</ref> (although C can use encodings other than [[ASCII]]). The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not "in" the string).▼ ▲The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, not "{{em\|in"}} the string). == History == Null-terminated strings were produced by the <code>.ASCIZ</code> directive of the [[PDP-11]] [[assembly language]]s and the <code>ASCIZ</code> directive of the [[MACRO-10]] macro assembly language for the [[PDP-10]]. These predate the development of the C programming language, but other forms of strings were often used. At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "[[String (computer science)#Length-prefixed\|length-prefixed]]"), used a leading ''byte'' to store the length of the string. This allows the string to contain NUL and made finding the length ~~of an already stored string,~~ need only one memory access (O(1) [[constant time\|(constant) time]]), but limited string length to 255 characters ~~(on a machine using 8-bit bytes)~~. C designer [[Dennis Ritchie]] chose to follow the convention of null-termination to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator.<ref>{{cite conference \| first = Dennis M. \| last = Ritchie \| date = April 1993 \| ___location = Cambridge, MA \| title = The development of the C language \| conference = Second History of Programming Languages conference \| url = https://www.bell-labs.com/usr/dmr/www/chist.html }}</ref><ref>{{cite book \| first = Dennis M. \| last = Ritchie \| article = The development of the C language \| title = History of Programming Languages \| edition = 2 \| editor-first1= Thomas J. \| editor-last1= Bergin, Jr. \| editor-first2 = Richard G. \| editor-last2 = Gibson, Jr. \| publisher = ACM Press \| ___location = New York \| via = Addison-Wesley (Reading, Mass) \| year = 1996 \| isbn = 0-201-89502-1 }}</ref> This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation\|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family\|ES/9000]] 520 in 1992 and the vector string instructions to the [[IBM z13 (microprocessor)\|IBM z13]] in 2015.<ref name=pop>[http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf IBM z/Architecture Principles of Operation]</ref> [[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation \|last=Kamp \|first=Poul-Henning \|date=25 July 2011 \|title=The Most Expensive One-byte Mistake \|journal=ACM Queue \|volume=9 \|number=7 \|pages=40–43 \|doi=10.1145/2001562.2010365 \|s2cid=30282393 \|issn=1542-7730\|~~url~~doi-access=~~http://queue.acm.org/detail.cfm?id=2010365~~free }}</ref> == Limitations == Line 25 ⟶ 27: == Character encodings == Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web\|title=UTF-8, a transformation format of ISO 10646\|date=November 2003 \|url=http://tools.ietf.org/html/rfc3629#section-3\|access-date=19 September 2013 \|last1=Yergeau \|first1=François }}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web\|title=Unicode/UTF-8-character table\|url=http://www.utf8-chartable.de/\|access-date=13 September 2013}}</ref><ref>{{cite web\|last=Kuhn\|first=Markus\|title=UTF-8 and Unicode FAQ\|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8\|access-date=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an [[UTF-8#Overlong encodings\|overlong encoding]], and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8. [[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL (0x0000).