String (computer science): Difference between revisions

Content deleted Content added
Fixed typos
Tags: Reverted references removed Visual edit Mobile edit Mobile web edit
top: cf. talk
 
(25 intermediate revisions by 21 users not shown)
Line 1:
{{Short description|Sequence of characters, data type}}
[[File:String example.png|alt=Diagram of String data in computing. Shows the word "example" with each letter in a separate box. The word "String" is above, referring to the entire sentence. The label "Character" is below and points to an individual box.|thumb|Strings are typically made up of [[character (computing)|characters]], and are often used to store human-readable data, such as words or sentences.]]
 
In [[computer programming]], a '''string''' is traditionally a [[sequence]] of [[character (computing)|characters]], either as a [[literal (computer programming)|literal constant]] or as some kind of [[Variable (computer science)|variable]]. The latter may allow its elements to be [[Immutable object|mutated]] and the length changed, or it may be fixed (after creation). A string is generally considered as a data type and is often implemented as an [[array data structure]] of bytes[[byte]]s (or words[[word (computer architecture)|word]]s) that stores a sequence of elements, typically characters, using some [[character encoding]]. More general, ''Stringstring'' may also denote morea general arrayssequence (or other sequence[[List (orabstract data type)|list]]) of data typesother andthan structuresjust characters.
 
Depending on the programming language and precise data type used, a [[variable (programming)|variable]] declared to be a string may either cause storage in memory to be statically allocated for a predetermined maximum length or employ [[dynamic allocation]] to allow it to hold a variable number of elements.
 
When a string appears literally in [[source code]], it is known as a [[string literal]] or an anonymous string.<ref>{{cite web|url=http://www.acsu.buffalo.edu/~fineberg/mfc158/week10lecture.htm|title=Introduction To Java – MFC 158 G|quote=String literals (or constants) are called ‘anonymous strings’|url-status=live|archive-url=https://web.archive.org/web/20160303233357/http://www.acsu.buffalo.edu/~fineberg/mfc158/week10lecture.htm|archive-date=2016-03-03}}</ref>
 
In [[formal languageslanguage]]s, which are used in [[mathematical logic]] and [[theoretical computer science]], a string is a finite sequence of symbols[[symbol (formal)|symbol]]s that are chosen from a [[set (mathematics)|set]] called an [[alphabet (computer science)|alphabet]].
 
== Purpose ==
 
A primary purpose of strings is to store human-readable text, like words and sentences. Strings are used to communicate information from a computer program to the user of the program.<ref>{{cite web |url=https://users.cs.utah.edu/~germain/PPS/Topics/strings.html |website=University of Utah, Kahlert School of Computing |title=Strings
|first=H. James |last=de St. Germain }}</ref> A program may also accept string input from its user. Further, strings may store data expressed as characters yet not intended for human reading.
 
Example strings and their purposes:
 
* A message like "<code>file upload complete</code>" is a string that software shows to [[end usersuser]]s. In the program's [[source code]], this message would likely appear as a [[string literal]].
* User-entered text, like "<code>I got a new job today</code>" as a status update on a [[social media]] service. Instead of a string literal, the software would likely store this string in a [[database]].
* Alphabetical data, like "<code>AGATGCCGT</code>" representing nucleic acid sequences of [[DNA]].<ref>{{cite web |url=https://plant-breeding-genomics.extension.org/dna-as-a-biochemical-entity-and-data-string/ |title=DNA as a Biochemical Entity and Data String |date=November 14, 2019 |first1=David M. |last1=Francis |first2=Heather L. |last2=Merk }}</ref>
* Computer settings or parameters, like "<code>?action=edit</code>" as a URL [[query string]]. Often these are intended to be somewhat human-readable, though their primary purpose is to communicate to computers.
 
The term string may also designate a sequence of data or computer records other than characters {{mdash}} like a "string of bits[[bit]]s" {{mdash}} but when used without qualification it refers to strings of characters.<ref name=Burchfield1986 />
 
==History==
 
Use of the word "string" to mean any items arranged in a line, series or succession dates back centuries.<ref>{{cite encyclopedia |encyclopedia=The Oxford English Dictionary |volume=X |publisher=Oxford at the Clarendon Press |year=1933 |title=string }}</ref><ref>{{cite web |title=string (n.) |url=https://www.etymonline.com/search?q=string |website=Online Etymology Dictionary }}</ref> In 19th-Centurycentury typesetting, [[Compositor (typesetting)|compositors]] used the term "string" to denote a length of type printed on paper; the string would be measured to determine the compositor's pay.<ref>{{cite encyclopedia |encyclopedia=The Century Dictionary |author-link1=William Dwight Whitney |author-link2=Benjamin Eli Smith |first1=William Dwight |last1=Whitney |first2=Benjamin E. |last2=Smith |publisher=The Century Company |___location=New York |page=5994 |title=string }}</ref><ref name=Burchfield1986 /><ref>{{cite news |newspaper=[[Milwaukee Journal Sentinel|Milwaukee Sentinel]] |date=January 11, 1898 |title=Old Union's Demise |page=3 }}</ref>
 
Use of the word "string" to mean "a sequence of symbols or linguistic elements in a definite order" emerged from mathematics, [[symbolic logic]], and [[linguistic theory]] to speak about the [[formal system|formal]] behavior of symbolic systems, setting aside the symbols' meaning.<ref name=Burchfield1986>{{cite encyclopedia |title=string |encyclopedia=A Supplement to the Oxford English Dictionary |year=1986 |last=Burchfield |first=R.W. |publisher=Oxford at the Clarendon Press |author-link=Robert Burchfield }}</ref>
 
For example, logician [[C. I. Lewis]] wrote in 1918:<ref>{{cite book |title=A survey of symbolic logic |___location=Berkeley |publisher=University of California Press |year=1918 |page=355 |last=Lewis |first=C.I. |author-link=C.I. Lewis |url=https://archive.org/details/asurveyofsymboli00lewiuoft/page/355/mode/1up }}</ref>
For example, logician C. I. Lewis wrote in 1918:
<blockquote>
A mathematical system is any set of strings of recognisable marks in which some of the strings are taken initially and the remainder derived from these by operations performed according to rules which are independent of any meaning assigned to the marks. That a system should consist of 'marks' instead of sounds or odours is immaterial.
</blockquote>
 
According to [[Jean E. Sammet]], "the first realistic string handling and pattern matching language" for computers was [[COMIT]] in the 1950s, followed by the [[SNOBOL]] language of the early 1960s.<ref>{{cite journal |url=https://redirect.cs.umbc.edu/courses/undergraduate/331/resources/papers/sammet1972.pdf |journal=Communications of the ACM |first=Jean E. |last=Sammet |title=Programming Languages: History and Future |date=July 1972 |volume=15 |number=7 |doi=10.1145/361454.361485 |s2cid=2003242 }}</ref>
 
== String datatypes ==
Line 42 ⟶ 43:
 
=== String length ===
Although formal strings can have an arbitrary finite length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: ''fixed-length strings'', which have a fixed maximum length to be determined at [[compile time]] and which use the same amount of memory whether this maximum is needed or not, and ''variable-length strings'', whose length is not arbitrarily fixed and which can use varying amounts of memory depending on the actual requirements at run time (see [[Memory management]]). Most strings in modern [[programming languages]] are variable-length strings. Of course, even variable-length strings are limited in length by the sizeamount of available [[computer memory]]. The string length can be stored as a separate integer (which may put another artificial limit on the length) or implicitly through a termination character, usually a character value with all bits zero such as in C programming language. See also "[[#Null-terminated|Null-terminated]]" below.
 
=== Character encoding ===
Line 57 ⟶ 58:
Strings are typically implemented as [[array data type|arrays]] of bytes, characters, or code units, in order to allow fast access to individual units or substrings—including characters when they have a fixed length. A few languages such as [[Haskell (programming language)|Haskell]] implement them as [[linked list]]s instead.
 
A lot of high-level languages provide strings as a primitive data type, such as [[JavaScript]] and [[PHP]], while most others provide them as a composite data type, some with special language support in writing literals, for example, [[Java (programming language)|Java]] and [[C Sharp (programming language)|C#]].
Some languages, such as [[Prolog]] and [[Erlang (programming language)|Erlang]], avoid implementing a dedicated string datatype at all, instead adopting the convention of representing strings as lists of character codes.
 
Some languages, such as [[C (programming language)|C]], [[Prolog]] and [[Erlang (programming language)|Erlang]], avoid implementing a dedicated string datatype at all, instead adopting the convention of representing strings as lists of character codes. Even in programming languages having a dedicated string type, string can usually be iterated as a sequence character codes, like lists of integers or other values.
 
=== Representations ===
Line 193 ⟶ 196:
While character strings are very common uses of strings, a string in computer science may refer generically to any sequence of homogeneously typed data. A [[bit string]] or [[byte string]], for example, may be used to represent non-textual [[binary data]] retrieved from a communications medium. This data may or may not be represented by a string-specific datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the programming language being used. If the programming language's string implementation is not [[8-bit clean]], data corruption may ensue.
 
C programmers draw a sharp distinction between a "string", aka a "string of characters", which by definition is always null terminated, vs. a "bytearray string"of or "pseudo stringcharacters" which may be stored in the same array but is often not null terminated.
Using [[C string handling]] functions on such aan "bytearray string"of characters often seems to work, but later leads to [[#Security concerns|security problems]].<ref>
[https://www.sudo.ws/todd/papers/strlcpy.html "strlcpy and strlcat - consistent, safe, string copy and concatenation."] {{webarchive|url=https://web.archive.org/web/20160313163321/https://www.sudo.ws/todd/papers/strlcpy.html |date=2016-03-13 }}
</ref><ref>
Line 251 ⟶ 254:
{{See also|Tuple}}
 
Let Σ be a [[finite set]] of distinct, unambiguous symbols (alternatively called characters), called the [[Alphabet (formal languages)|alphabet]]. A '''string''' (or '''word'''<ref>{{cite book |last1=Fletcher |first1=Peter |last2=Hoyle |first2=Hughes |last3=Patty |first3=C. Wayne |year=1991 |title=Foundations of Discrete Mathematics |publisher=PWS-Kent |isbn= 0-53492-373-9 |page=114 |quote=Let Σ be an alphabet. A '''nonempty word over Σ''' is a finite sequence with ___domain ''I<sub>n</sub>'' (for some ''n'' ∈ ℕ) and codomain Σ.}}</ref> or '''expression'''<ref>{{cite book |last=Shoenfield |first=Joseph R. |authorlink=Joseph R. Shoenfield |year=2010 |origyearorig-date=1967 |title=Mathematical Logic |edition=Reprint |publisher=CRC Press |isbn=978-156881135-2 |page=2 |quote=Any finite sequence of symbols of a language is called an ''expression'' of that language.}}</ref>) over Σ is any finite [[sequence]] of symbols from Σ.<ref name="partee">{{cite book |author1=Barbara H. Partee |author2=Alice ter Meulen|author2-link=Alice ter Meulen |author3=Robert E. Wall |title=Mathematical Methods in Linguistics |publisher=Kluwer |year=1990}}</ref> For example, if Σ = {0,&nbsp;1}, then ''01011'' is a string over Σ.
 
The ''[[length]]'' of a string ''s'' is the number of symbols in ''s'' (the length of the sequence) and can be any [[non-negative integer]]; it is often denoted as |''s''|. The ''[[empty string]]'' is the unique string over Σ of length 0, and is denoted ''ε'' or ''λ''.<ref name="partee"/><ref>{{cite book| author=John E. Hopcroft, Jeffrey D. Ullman| title=Introduction to Automata Theory, Languages, and Computation| year=1979| publisher=Addison-Wesley| isbn=0-201-02988-X| url-access=registration| url=https://archive.org/details/introductiontoau00hopc}} Here: sect.1.1, p.1</ref>