Null-terminated string: Difference between revisions

Content deleted Content added
m Disambiguating links to C string (link changed to C string handling) using DisamAssist.
Citation bot (talk | contribs)
Add: date, authors 1-1. Removed URL that duplicated identifier. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:String data structures | #UCB_Category 1/16
Line 15:
This had some influence on CPU [[instruction set]] design. Some CPUs in the 1970s and 1980s, such as the [[Zilog Z80]] and the [[Digital Equipment Corporation|DEC]] [[VAX]], had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the [[IBM ES/9000 family|ES/9000]] 520 in 1992 and the vector string instructions to the [[IBM z13 (microprocessor)|IBM z13]] in 2015.<ref name=pop>[http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf IBM z/Architecture Principles of Operation]</ref>
 
[[FreeBSD]] developer [[Poul-Henning Kamp]], writing in ''[[ACM Queue]]'', referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.<ref>{{citation |last=Kamp |first=Poul-Henning |date=25 July 2011 |title=The Most Expensive One-byte Mistake |journal=ACM Queue |volume=9 |number=7 |pages=40–43 |doi=10.1145/2001562.2010365 |s2cid=30282393 |issn=1542-7730|url=http://queue.acm.org/detail.cfm?id=2010365 |doi-access=free }}</ref>
 
== Limitations ==
Line 27:
 
== Character encodings ==
Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possible [[ASCII]] or [[UTF-8]] string.<ref>{{cite web|title=UTF-8, a transformation format of ISO 10646|date=November 2003 |url=http://tools.ietf.org/html/rfc3629#section-3|access-date=19 September 2013 |last1=Yergeau |first1=François }}</ref><ref><!-- This is the encoding table provided as a resource by the Unicode consortium: http://www.unicode.org/resources/utf8.html -->{{cite web|title=Unicode/UTF-8-character table|url=http://www.utf8-chartable.de/|access-date=13 September 2013}}</ref><ref>{{cite web|last=Kuhn|first=Markus|title=UTF-8 and Unicode FAQ|url=http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8|access-date=13 September 2013}}</ref> However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "[[modified UTF-8]]" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is an [[UTF-8#Overlong encodings|overlong encoding]], and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.
 
[[UTF-16]] uses 2-byte integers and as either byte may be zero (and in fact ''every other'' byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit [[UTF-16]] characters, terminated by a 16-bit NUL (0x0000).