Null-terminated string: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 18:15, 14 May 2018 edit 65.23.129.243 (talk) →Character encodings: Clarifications. ← Previous edit		Latest revision as of 01:23, 25 March 2025 edit undo Eniagrom (talk \| contribs) Extended confirmed users 1,363 edits →History: removed dead link
(49 intermediate revisions by 36 users not shown)
Line 1: {{Short description\|Data structure}} {{Redirect\|CString\|\|C string (disambiguation)}} {{see also\|String (computer science)#Null-terminated}} {{Use dmy dates\|date=~~January~~August ~~2011~~2021}}▼ In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure\|array]] containing the characters and terminated with a [[null character]] (<code>'\0'</code>, called NUL in [[ASCII]]). Alternative names are '''[[C string]]''', which refers to the [[C (programming language)\|C programming language]] and '''ASCIIZ''' (note that C strings do not imply the use of ASCII).▼ ▲In [[computer programming]], a '''null-terminated string''' is a [[character string]] stored as an [[Array data structure\|array]] containing the characters and terminated with a ''[[null character]] ~~(<code>~~'\0'~~</code>~~ (a character with an internal value of zero, called "NUL" in this article, not same as the [[~~ASCII~~glyph]] zero). Alternative names are '''[[C string handling\|C string]]''', which refers to the [[C (programming language)\|C programming language]] and '''ASCIIZ'''<ref>{{Cite ~~(note~~web ~~that~~\|title=Chapter C15 ~~strings~~- doMIPS ~~not~~Assembly ~~imply~~Language ~~the~~\|url=https://people.scs.carleton.ca/~sivarama/org_book/org_book_web/solution_manual/org_soln_one/arch_book_solution_ch15.pdf \|access-date=2023-10-09 \|website=Carleton University}}</ref> (although C can use ofencodings other than [[ASCII]]). The length of a C string is found by searching for the (first) NUL byte. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a NUL cannot be inside the string, as the only NUL is the one marking the end.▼ ▲The length of a C string is found by searching for the (first) NUL ~~byte~~. This can be slow as it takes O(''n'') ([[linear time]]) with respect to the string length. It also means that a ~~NUL~~string cannot becontain ~~inside~~a ~~the~~NUL ~~string,~~(there asis ~~the only~~a NUL in memory, but it is after the ~~one~~last ~~marking~~character, not {{em\|in}} the ~~end~~string). == History == Null-terminated strings were produced by the <code>.ASCIZ</code> directive of the [[PDP-11]] [[assembly language]]s and the <code>ASCIZ</code> directive of the [[MACRO-10]] macro assembly language for the [[PDP-10]]. These predate the development of the C programming language, but other forms of strings were often used. At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "[[String (computer science)#Length-prefixed\|length-prefixed]]"), used a leading ''byte'' to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1) [[constant time\|(constant) time]]), but limited string length to 255 characters ~~(on a machine using 8-bit bytes)~~. C designer [[Dennis Ritchie]] chose to follow the convention of ~~NUL~~null-termination~~, already established in [[BCPL]],~~ to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator.<ref>{{cite book \| first = Dennis M. \| last = Ritchie ~~(1993).~~\| article = [The development of the C language]. ~~Proc.~~\| title ~~2nd~~= History of Programming Languages ~~Conf~~\| edition = 2 \| editor-first1= Thomas J. \| editor-last1= Bergin, Jr. \| editor-first2 = Richard G. \| editor-last2 = Gibson, Jr. \| publisher = ACM Press \| ___location = New York \| via = Addison-Wesley (Reading, Mass) \| year = 1996 \| isbn = 0-201-89502-1 }}</ref> This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation\|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family\|ES/9000]] 520 in 1992.▼ ▲This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation\|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the ~~NUL~~null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family\|ES/9000]] 520 in 1992 and the vector string instructions to the [[IBM z13 (microprocessor)\|IBM z13]] in 2015.<ref name=pop>[http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf IBM z/Architecture Principles of Operation]</ref> [[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', would later refer to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation \|last=Kamp \|first=Poul-Henning \|date=25 July 2011 \|title=The Most Expensive One-byte Mistake \|journal=ACM Queue \|volume=9 \|number=7 \|issn=1542-7730 \|accessdate=2 August 2011 \|url=http://queue.acm.org/detail.cfm?id=2010365 }}</ref>▼ ▲[[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', ~~would later refer~~referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation \|last=Kamp \|first=Poul-Henning \|date=25 July 2011 \|title=The Most Expensive One-byte Mistake \|journal=ACM Queue \|volume=9 \|number=7 \|~~issn~~pages=~~1542-7730~~40–43 \|~~accessdate~~doi=210.1145/2001562.2010365 ~~August 2011~~\|s2cid=30282393 \|~~url~~issn=~~http://queue.acm.org/detail.cfm?id~~1542-7730\|doi-access=~~2010365~~free }}</ref> ~~== Implementations ==~~ ~~{{expand section\|date=November 2011}}~~ ~~{{Main\|C string handling#Functions}}~~ [[C (programming language)\|C programming language]] supports null-terminated strings as the primary string type.<ref>{{cite web \|title=The Development of the C Language \|url=http://cm.bell-labs.com/cm/cs/who/dmr/chist.html \|accessdate=9 November 2011 \|first=Dennis \|last=Richie \|year=2003 }}</ref> There are many [[C string handling\|functions for string handling]] in the [[C standard library]]. Operations supported include: * Determining the length of a string * Copying one string to another * Appending (concatenating) one string to another * Finding the first (or last) occurrence of a character within a string * Finding within a string the first occurrence of a character in (or not in) a given set * Finding the first occurrence of a substring within a string * Comparing two strings lexicographically * Splitting a string into multiple substrings * Formatting numeric or string values into a printable output string * Parsing a printable string into numeric values * Converting between [[SBCS\|single-byte]] and [[wide character]] string encodings * Converting single-byte or wide character strings to and from [[variable-width encoding\|multi-byte]] character strings == Limitations == While simple to implement, this representation has been prone to errors and performance problems. ~~The NUL~~ Null-termination has historically created [[computer insecurity\|security problems]].<ref>{{cite journal\|url= http://insecure.org/news/P55-07.txt \|author=Rain Forest Puppy \|title=Perl CGI problems \|~~work~~journal=Phrack Magazine \|publisher=artofhacking.com \|date=9 September 1999 \|volume=9 \|issue=55 \|page=7 \|~~accessdate~~access-date=3 January 2016}}</ref> A NUL ~~byte~~ inserted into the middle of a string will truncate it unexpectedly.<ref>{{Cite web\|url=https://security.stackexchange.com/questions/48187/null-byte-injection-on-php\|title = Null byte injection on PHP?}}</ref> A common bug was to not allocate the additional space for the NUL, so it was written over adjacent memory. Another was to not write the NUL at all, which was often not detected during testing because athe ~~NUL was already there by chance from previous use~~block of ~~the~~memory ~~same~~already ~~block~~contained ~~of memory~~zeros. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-size [[Data buffer\|buffer]], causing a [[buffer overflow]] if it was too long. The inability to store a ~~NUL~~zero requires that ~~string data~~text and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used. The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(''n'') anyway, such as in <code>[[strlcpy]]</code>. However, this does not always result in an intuitive [[API]]. == Character encodings == Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere,; therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web\|title=UTF-8, a transformation format of ISO 10646\|date=November 2003 \|url=http://tools.ietf.org/html/rfc3629#section-3\|~~accessdate~~access-date=19 September 2013 \|last1=Yergeau \|first1=François }}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web\|title=Unicode/UTF-8-character table\|url=http://www.utf8-chartable.de/\|~~accessdate~~access-date=13 September 2013}}</ref><ref>{{cite web\|last=Kuhn\|first=Markus\|title=UTF-8 and Unicode FAQ\|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8\|~~accessdate~~access-date=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except ~~the~~ NUL ~~character~~ – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes ~~the~~ NUL ~~character~~ as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, asbecause it is aan ~~security~~[[UTF-8#Overlong ~~risk.~~encodings\|overlong ~~A 0xC0~~encoding]], ~~0x80~~and ~~NUL~~it ~~might be~~is seen as a ~~string terminator in~~ security ~~validation and as a character when used~~risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8. [[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL ~~character.~~ (~~Again the NUL character, which encodes as a single zero code unit, is the only character that cannot be stored. UTF-16 does not have any alternative encoding of zero~~0x0000). == Improvements == Many attempts to make C string handling less error prone have been made. One strategy is to add safer functions such as <code>[[strdup]]</code> and <code>[[strlcpy]]</code>, whilst [[C standard library#Buffer overflow vulnerabilities \| deprecating the use of unsafe functions]] such as <code>[[gets() \| gets]]</code>. Another is to add an object-oriented wrapper around C strings so that only safe calls can be done. ~~Neither has had a huge success as~~However, it is ~~always~~ possible ~~and tempting~~ to call the unsafe functions anyway. Most modern libraries replace C strings with a structure containing a 32-bit or larger length value (far more than were ever considered for length-prefixed strings), and often add another pointer, a reference count, and even a NUL to speed up conversion back to a C string!. Memory is far larger now, such that if the addition of 3 (or 16, or more) bytes to each string is a real problem the software will have to be dealing with so many small strings that some other storage method will save even more memory (for instance there may be so many duplicates that a [[hash table]] will use less memory). Examples include the [[C++]] [[Standard Template Library]] <code>[[String (C++)\|std::string]]</code>, the [[Qt (toolkit)\|Qt]] <code>QString</code>, the [[Microsoft Foundation Class Library\|MFC]] <code>CString</code>, and the C-based implementation <code>CFString</code> from [[Core Foundation]] as well as its [[Objective-C]] sibling <code>NSString</code> from [[Foundation Kit\|Foundation]], both by Apple. More complex structures may also be used to store strings such as the [[rope (computer science)\|rope]]. ==See also== [[Empty string]] [[Sentinel value]] ==References== Line 57 ⟶ 45: {{CProLang}} {{Data types}} ▲{{Use dmy dates\|date=January 2011}} [[Category:String data structures]]