Stringa (linguaggi formali)

Per altri significati del termine di veda la voce: stringa

In varie discipline della matematica e dell'informatica, per stringa si intende una sequenza composta da un certo numero di oggetti che ci si aspetta venga sottoposta ad elaborazioni come analisi, composizioni e trasformazioni in altre stringhe o strutture discrete come grafi o configurazioni numeriche, senza modificare gli oggetti componenti.

semplici (come bits, caratteri o simboli) o da trattare come se fossero semplici (come parole, espressioni, frammenti di testo o contrassegni di oggetti compositi ma che non si vogliono analizzare o decomporre).

Dal punto di vista della costituzione si distinguono innanzi tutto le stringhe di oggetti semplici: stringhe di bit (stringhe binarie), stringhe composte da caratteri (stringhe letterali), stringhe di simboli (stringhe simboliche). Tra le stringhe di oggetti compositi si collocano le stringhe composte da unità lessicali (dette anche token), i frammenti di testo (come i titoli dei paragrafi o le citazioni bibliografiche) e le stringhe che rappresentano molecole dotate di struttura complessiva filamentosa, come quelle che rappresentano una proteina o una struttura di DNA.

Tra gli strumenti che possono sottoporre le stringhe ad elaborazioni si distinguono quelli formali, come gli automi, le macchine di Turing, le grammatiche formali o altri sistemi di riscrittura e quelli più concreti costituiti da programmi o routines di sistemi software. In linea di principio gli strumenti del secondo genere si possono considerare implementazioni dei primi.

Trattamento formale

One starts with a non-empty finite set Σ called an alphabet. Elements of this alphabet are called characters. A string (or word) over Σ is any finite sequence of characters from Σ.

A particularly important string is the sequence of no characters, called the empty string. The empty string is often denoted ε or λ. Note that one does not allow infinite sequences of characters.

For example, if Σ = {0, 1}, strings over Σ are of the form

ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, …

The set of all strings over Σ is denoted Σ*. One can define a binary operation on Σ* called string concatenation. If s and t are two strings, their concatenation, denoted st, is defined as the sequence of characters in s followed by the sequence of characters in t.

For example, if s = bear and t = hug then st = bearhug and ts = hugbear.

String concatenation is an associative, but non-commutative operation. The empty string serves as the identity element. In algebraic terms, the set Σ* forms a monoid under string concatenation. In fact, Σ* is the free monoid generated by Σ.

The length of a string is the number of characters in the string. The length can be any natural number. The length of the empty string is 0. Algebraically speaking, the length function defines a monoid homomorphism from Σ* to N (Non-negative integers with addition).

A string s is said to be a substring of t if there exist two strings u and v such that t = usv. One, or both, of u and v can be empty. The relation "is a substring of" defines a partial order on Σ*, the least element of which is the empty string.

More often, especially in computing applications, one is interested in a different kind of ordering on the set of strings. If the alphabet Σ is well-ordered (cf. alphabetical order) one can define a well-ordering on Σ* called lexicographical order. The empty string is also the least element with respect to this ordering.

A set of strings over Σ (i.e. a subset of Σ*) is called a formal language over Σ. Note that while the alphabet is a finite set and every string has finite length, a language may very well have infinitely many member strings. In fact, Σ* itself is always an infinite language. Important examples of formal languages include regular expressions and formal grammars.

Tipi di dati string

A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language. In some languages they are available as primitive types and in others as composite types.

Although formal strings can have an arbitrary (but finite) length, strings in real languages have limited length. In general, there are two types of string datatypes: fixed length strings which have a fixed maximum length, and variable length strings whose length is not arbitrarly fixed. Most strings in modern programming languages are variable length strings. Despite the name, even variable length strings are limited in length; although, generally, the limit depends only on the amount of memory available.

String datatypes also differ by the alphabet over which they are defined. Some strings are designed to work over a single fixed alphabet (such as the Latin alphabet or the characters representable by ASCII) and others over an arbitrary alphabet. For example, in the C++ standard library strings are designed as templates definable over any alphabet of characters.

Rappresentazioni

Representations of strings depend heavily on the choice of character set (such as an alphabet) and the method of character encoding. Older string implementations were designed to work with the character set and encoding defined by ASCII, or more recent extensions like the ISO 8859 series. Modern implementations often use the extensive character set defined by Unicode along with a variety of complex encodings such as UTF-8 and UTF-16.

Most string implementations are very similar to variable-length arrays with the entries storing the character codes of corresponding characters. The principal diffence is that, with certain encodings, a single logical character may take up more than one entry in the array. This happens for example with UTF-8, where single characters can take anywhere from one to four bytes. In these cases the logical length of the string differs from the logical length of the array.

The length of a string can be stored implicitly by using a special terminating character; often this is the null character having value zero, a convention used and perpetuated by the popular C programming language. The length of a string can also be stored explicitly, for example by prefixing the string with integer value (convention used in Pascal). Note that with terminated strings the terminating character is not an allowable character in any string.

Here is an example of a null-terminated string stored in a 10 byte buffer, along with its ASCII representation:

F	R	A	N	K	NUL	k	e	f	w
46	52	41	4E	4B	00	6B	65	66	77

The length of a string in the above example 5 characters, but note that it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage.

Here is the equivalent Pascal string:

length	F	R	A	N	K	k	f	f	w
05	46	52	41	4E	4B	6B	66	66	77

While these representations are common, others are possible. Using ropes makes certain string operations, such as insertions, deletions, and concatenations more efficient.

Algoritmi per le stringhe

Sono stati studiati molti algoritmi per la manipolazione delle stringhe i quali si distinguono per finalità e per diverse scelte di compromesso di fronte ad esigenze contrastanti come l'ampiezza della portata e la efficienza.

Questi algoritmi vengono collocati a categorie come le seguenti:

algoritmi di ricerca di stringhe aventi il compito di trovare una data sottostringa o una data configurazione di caratteri,
algoritmi di ordinamento (sorting),
algoritmi per espressioni regolari,
scansori e parser di stringhhe.

Gli algoritmi più avanzati per l'elaborazione di stringhe spesso impiegano elaborati meccanismi formali e complesse strutture di dati: tra questi ricordiamo gli alberi di suffissi e le macchine a stati finiti.

Linguaggi e programmi di utilità orientati all'elaborazione di stringhe

Le stringhe costituiscono un tipo di dati tanto ampiamente utilizzato da indurre lo sviluppo di numerosi linguaggi finalizzati alla facilitazione delle elaborazioni delle stringhe più richieste dalle applicazioni. Tra questi vi sono:

Molti programmi di utilità dell'ambito UNIX effettuano manipolazioni di stringhe relativamente semplici e possono essere utilizzati per programmare algoritmi di elaborazione di stringhe piuttosto efficaci. Questi strumentipossono trattare come stringhe anche i files e gli streams finiti.

I recenti linguaggi di scripting, come Perl, Python, Ruby e Tcl si servono di espressioni regolari per facilitare le manovre sui testi.

[ [ Categoria:Terminologia del computer ] ] ?