String (computer science)

A literal string (or string of characters) is an aggregate data type used in most programming languages to represent text.

The term string in a broader sense refers to a sequence of entities; for example, tokens in a language grammar, a sequence of states in automata or a representation of DNA.

Representation in programming languages

A common representation is an array of characters. The length can be stored implicitly by using a special terminating character (often NUL, ASCII code 0) -- the C programming language uses this convention -- or explicitly, for example by prefixing the string with integer value (convention used in Pascal).

Here is an example of a NUL terminated string stored in a 10 byte buffer, along with its ASCII representation:

F	R	A	N	K		k	f	f	w
46	52	41	4E	4B	00	6B	66	66	77

The length of a string in the above example 5 characters, but note that it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage.

Of course, other representations are possible. Using trees and lists makes certain string operations, such as character insertions or deletions, more efficient.

String manipulation

The two most common operations on strings are searching and sorting. Because there are so many practical uses for strings there are many associated algorithms with various tradeoffs.

Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees, finite state machines.

String utilities

Strings are such a useful datatype that several languages have been designed in order to make string processing applications easy to write. Examples include:

Many UNIX utilities perform simple string manipulations and can be used to easily program some powerful string processing algorithms. Files and finite streams may be viewed as strings.

Strings in theoretical computer science

In theoretical computer science, one starts with a non-empty finite set called the alphabet; strings are then defined as finite sequences of elements from the alphabet, including the empty sequence. The set of all strings over a given alphabet, together with string concatentation, then forms a monoid, in fact a free monoid. Formal languages, the central objects of study, are defined as subsets of this monoid.

See String