String operations: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 22:04, 11 January 2017 edit 128.189.133.101 (talk) →String homomorphism ← Previous edit		Latest revision as of 15:18, 12 May 2025 edit undo Kvng (talk \| contribs) Extended confirmed users, New page reviewers 116,075 edits Reverted good faith edits by Dominus (talk): Not clearly improving Tags: Twinkle Undo
(39 intermediate revisions by 26 users not shown)
Line 1: In [[computer science]], in the area of [[formal language theory]], frequent use is made of a variety of [[string functions]]; however, the notation used is different from that used onfor [[computer programming]], and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms. ==Strings and languages== Line 6: The concatenation of two string <math>s</math> and <math>t</math> is denoted by <math>s \cdot t</math>, or shorter by <math>s t</math>. Concatenating with the empty string makes no difference: <math>s \cdot \varepsilon = s = \varepsilon \cdot s</math>. Concatenation of strings is [[associative]]: <math>s \cdot (t \cdot u) = (s \cdot t) \cdot u</math>. For example, <math>(\langle b \rangle \cdot \langle l \rangle) \cdot (\varepsilon \cdot \langle ah \rangle) = \langle bl \rangle \cdot \langle ah \rangle = \langle blah \rangle</math>. Line 13: Besides the usual set operations like union, intersection etc., concatenation can be applied to languages: if both <math>S</math> and <math>T</math> are languages, their concatenation <math>S \cdot T</math> is defined as the set of concatenations of any string from <math>S</math> and any string from <math>T</math>, formally <math>S \cdot T = \{ s \cdot t \mid s \in S \land t \in T \}</math>. Again, the concatenation dot <math>\cdot</math> is often omitted for ~~shortness~~brevity. The language <math>\{\varepsilon\}</math> consisting of just the empty string is to be distinguished from the empty language <math>\{\}</math>. Line 34: ==String substitution== Let ''L'' be a [[language (computer science)\|language]], and let Σ be its alphabet. A '''string substitution''' or simply a '''substitution''' is a mapping ''f'' that maps ~~letters~~characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a ~~letter~~character ''a'' ∈ Σ, one has ''f''(''a'')=''L''<sub>''a''</sub> where ''L''<sub>''a''</sub> ⊆ Δ[[Kleene star\|<sup></sup>]] is some language whose alphabet is Δ. This mapping may be extended to strings as :''f''(ε)=ε Line 42: :''f''(''sa'')=''f''(''s'')''f''(''a'') for string ''s'' ∈ ''L'' and character ''a'' ∈ Σ. String substitutions may be extended to entire languages as <ref>Hopcroft, Ullman (1979), Sect.3.2, p.60</ref> :<math>f(L)=\bigcup_{s\in L} f(s)</math> [[Regular language]]s are closed under string substitution. That is, if each ~~letter~~character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language.<ref>Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.4, p.60</ref> Similarly, [[context-free language]]s are closed under string substitution.<ref>Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.2, p.131</ref><ref group="note">Although every regular language is also context-free, the previous theorem is not implied by the current one, since the former yields a shaper result for regular languages.</ref> A simple example is the conversion ''f''<sub>uc</sub>(.) to ~~upper case~~uppercase, which may be defined e.g. as follows: {\| class="wikitable" \|- ! ~~letter~~character !! mapped to language !! remark \|- ! ''x'' !! ''f''<sub>uc</sub>(''x'') !! \|- \| ‹''a''› \|\| { ‹''A''› } \|\| map ~~lower-case~~lowercase char to corresponding ~~upper-case~~uppercase char \|- \| ‹''A''› \|\| { ‹''A''› } \|\| map ~~upper-case~~uppercase char to itself \|- \| ‹''ß''› \|\| { ‹''SS''› } \|\| no ~~upper-case~~uppercase char available, map to two-char string \|- \| ‹0› \|\| { ε } \|\| map digit to empty string Line 76: For the extension of ''f''<sub>uc</sub> to languages, we have e.g. ''f''<sub>uc</sub>({ ‹Straße›, ‹u2›, ‹Go!› }) = { ‹STRASSE› } ∪ { ‹U› } ∪ { } = { ‹STRASSE›, ‹U› }. Another example is the conversion of an [[EBCDIC]]-encoded string to [[ASCII]].▼ ==String homomorphism== A '''string homomorphism''' (often referred to simply as a [[Homomorphism#~~Homomorphisms_and_e-free_homomorphisms_in_formal_language_theory~~Formal language theory\|homomorphism]] in [[formal language theory]]) is a string substitution such that each ~~letter~~character is replaced by a single string. That is, ''<math>f''(''a'')=''s''</math>, where ''<math>s''</math> is a string, for each ~~letter~~character ''<math>a''</math>.<ref group="note" name="singleton sets">Strictly formally, a homomorphism yields a language consisting of just one string, i.e. ''<math>f''(''a'') = \{''s''\}</math>.</ref><ref>Hopcroft, Ullman (1979), Sect.3.2, p.60-61</ref> String homomorphisms are [[monoid morphism]]s on the [[free monoid]], preserving the empty string and the [[binary operation]] of [[string concatenation]]. Given a language ''<math>L''</math>, the set ''<math>f''(''L'')</math> is called the '''homomorphic image''' of ''<math>L''</math>. The '''inverse homomorphic image''' of a string ''<math>s''</math> is defined as ~~:''f''~~<~~sup>−1</sup~~math>f^{-1}(''s'') = \{ ''w'' \|\mid ''f''(''w'') = ''s'' \}</math> while the inverse homomorphic image of a language ''<math>L''</math> is defined as ~~:''f''~~<~~sup>−1</sup~~math>f^{-1}(''L'') = \{ ''s'' \|\mid ''f''(''s'') ∈\in ''L'' \}</math> In general, ''<math>f''(''f~~''<sup>−1</sup>~~^{-1}(''L'')) ≠\neq ''L''</math>, while one does have <math>f(f^{-1}(L)) \subseteq L</math> ~~:''f''(''f''<sup>−1</sup>(''L'')) ⊆ ''L''~~ and <math>L \subseteq f^{-1}(f(L))</math> ~~:''L'' ⊆ ''f''<sup>−1</sup>(''f''(''L''))~~ for any language ''<math>L''</math>. The class of regular languages is closed under homomorphisms and inverse homomorphisms.<ref>Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.5, p.61</ref> Similarly, the context-free languages are closed under homomorphisms<ref group="note">This follows from the [[#String substitution\|above-mentioned]] closure under arbitrary substitutions.</ref> and inverse homomorphisms.<ref>Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.3, p.132</ref> A string homomorphism is said to be ε-free (or e-free) if ''<math>f''(''a'') ≠\neq ε\varepsilon</math> for all ''a'' in the alphabet Σ<math>\Sigma</math>. Simple single-letter [[substitution cipher]]s are examples of (ε-free) string homomorphisms. An example string homomorphism ''g''<sub>uc</sub> can also be obtained by defining similar to the [[#String_substitution\|above]] substitution: ''g''<sub>uc</sub>(‹a›) = ‹A›, ..., ''g''<sub>uc</sub>(‹0›) = ε, but letting ''g''<sub>uc</sub> be undefined on punctuation chars. <!---omitted, since sloppy notation, anyway, see above--- Besides this restriction of its input ___domain, ''g''<sub>uc</sub> differs from ''f''<sub>uc</sub> by returning strings,<ref group=note name="singleton sets"/> while the latter returned singleton sets of strings. Line 114 ⟶ 112: For the latter language, ''g''<sub>uc</sub>(''g''<sub>uc</sub><sup>−1</sup>({ ‹A›, ‹bb› })) = ''g''<sub>uc</sub>({ ‹a› }) = { ‹A› } ≠ { ‹A›, ‹bb› }. The homomorphism ''g''<sub>uc</sub> is not ε-free, since it maps e.g. ‹0› to ε. ▲~~Another~~A very simple string homomorphism example that maps each character to just a character is the conversion of an [[EBCDIC]]-encoded string to [[ASCII]]. ==String projection== If ''s'' is a string, and <math>\Sigma</math> is an alphabet, the '''string projection''' of ''s'' is the string that results by removing all ~~letters~~characters ~~which~~that are not in <math>\Sigma</math>. It is written as <math>\pi_\Sigma(s)\,</math>. It is formally defined by removal of ~~letters~~characters from the right hand side: :<math>\pi_\Sigma(s) = \begin{cases} Line 124: \end{cases}</math> Here <math>\varepsilon</math> denotes the [[empty string]]. The projection of a string is essentially the same as a [[projection in relational algebra]]. String projection may be promoted to the '''projection of a language'''. Given a [[formal language]] ''L'', its projection is given by :<math>\pi_\Sigma (L)=\{\pi_\Sigma(s)\ \vert\ s\in L \}</math>{{citation needed\|date=August 2017}} ==Right quotient==▼ The '''right quotient''' of a letter ''a'' from a string ''s'' is the truncation of the letter ''a'' in the string ''s'', from the right hand side. It is denoted as <math>s/a</math>. If the string does not have ''a'' on the right hand side, the result is the empty string. Thus:▼ ▲== Right and left quotient == :<math>(sa)/ b = \begin{cases} ▼ ▲The '''right quotient''' of a ~~letter~~character ''a'' from a string ''s'' is the truncation of the ~~letter~~character ''a'' in the string ''s'', from the right hand side. It is denoted as <math>s/a</math>. If the string does not have ''a'' on the right hand side, the result is the empty string. Thus: <!---This definition deviates from Hopcroft.Ullman.1979, as remarked below. I guess the former doesn't have widespread use, if it has a source at all.---> ▲: <math>(sa)/ b = \begin{cases} s & \mbox{if } a=b \\ \varepsilon & \mbox{if } a \ne b Line 139: The quotient of the empty string may be taken: : <math>\varepsilon / a = \varepsilon</math>▼ Similarly, given a subset <math>S\subset M</math> of a monoid <math>M</math>, one may define the quotient subset as ▼ ▲:<math>\varepsilon / a = \varepsilon</math> : <math>S/a=\{s\in M\ \vert\ sa\in S\}</math>▼ '''Left quotients''' may be defined similarly, with operations taking place on the left of a string.{{citation needed\|date=August 2017}}▼ ▲Similarly, given a subset <math>S\subset M</math> of a monoid <math>M</math>, one may define the quotient subset as Hopcroft and Ullman (1979) define the quotient ''L''<sub>1</sub>/''L''<sub>2</sub> of the languages ''L''<sub>1</sub> and ''L''<sub>2</sub> over the same alphabet as {{nowrap\|1=''L''<sub>1</sub>/''L''<sub>2</sub> = {{mset\| ''s'' {{!}} ∃''t''∈''L''<sub>2</sub>. ''st''∈''L''<sub>1</sub> }}}}.<ref>Hopcroft, Ullman (1979), Sect.3.2, p.62</ref> ▲:<math>S/a=\{s\in M\ \vert\ sa\in S\}</math> This is not a generalization of the above definition, since, for a string ''s'' and distinct characters ''a'', ''b'', Hopcroft's and Ullman's definition implies {{nowrap\|\|{{mset\|''sa''}} / {{mset\|''b''}}}} yielding {{mset\|}}, rather than {{mset\| ε }}. The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language ''L''<sub>1</sub> and an arbitrary language ''L''<sub>2</sub> is known as [[Brzozowski derivative]]; if ''L''<sub>2</sub> is represented by a [[regular expression]], so can be the left quotient.<ref>{{cite journal\| author=Janusz A. Brzozowski\| authorlink=Janusz Brzozowski (computer scientist)\|title=Derivatives of Regular Expressions\| journal=J ACM\| year=1964\| volume=11\| issue=4\| pages=481–494\| doi=10.1145/321239.321249\| s2cid=14126942\| doi-access=free}}</ref> ▲Left quotients may be defined similarly, with operations taking place on the left of a string. ==Syntactic relation== Line 157 ⟶ 160: :<math>\{S/m\ \vert\ m\in M\}</math> is finite. In ~~this~~the case that ''M'' is the monoid of words over some alphabet, ''S'' is then a [[~~recognizable~~regular language]], that is, a language that can be recognized by a [[finite -state automaton]]. This is discussed in greater detail in the article on [[syntactic monoid]]s.{{citation needed\|date=August 2017}} ==Right cancellation== The '''right cancellation''' of a ~~letter~~character ''a'' from a string ''s'' is the removal of the first occurrence of the ~~letter~~character ''a'' in the string ''s'', starting from the right hand side. It is denoted as <math>s\div a</math> and is recursively defined as :<math>(sa)\div b = \begin{cases} Line 173 ⟶ 176: Clearly, right cancellation and projection [[Commutative property\|commute]]: :<math>\pi_\Sigma(s)\div a = \pi_\Sigma(s \div a )</math>{{citation needed\|date=August 2017}} ==Prefixes== Line 180 ⟶ 183: :<math>\operatorname{Pref}_L(s) = \{t\ \vert\ s=tu \mbox { for } t,u\in \operatorname{Alph}(L)^\}</math> ~~here~~where <math>s\in L</math>. The '''prefix closure of a language''' is :<math>\operatorname{Pref} (L) = \bigcup_{s\in L} \operatorname{Pref}_L(s) = \left\{ t\ \vert\ s=tu; s\in L; t,u\in \operatorname{Alph}(L)^ \right\}</math> Line 189 ⟶ 192: <math>L=\left\{abc\right\}\mbox{ then } \operatorname{Pref}(L)=\left\{\varepsilon, a, ab, abc\right\}</math> A language is called '''prefix closed''' if <math>\operatorname{Pref} (L) = L</math>. The prefix closure operator is [[idempotent]]: Line 195 ⟶ 198: :<math>\operatorname{Pref} (\operatorname{Pref} (L)) =\operatorname{Pref} (L)</math> The '''prefix relation''' is a [[binary relation]] <math>\sqsubseteq</math> such that <math>s\sqsubseteq t </math> if and only if <math>s \in \operatorname{Pref}_L(t)</math>. This relation is a particular example of a [[prefix order]].{{citation needed\|date=August 2017}} ==See also == Line 207 ⟶ 210: == References == * {{cite book \| first1=John E. \| last1=Hopcroft \| first2=Jeffrey D. \| last2=Ullman \| title=Introduction to Automata Theory, Languages and Computation \| publisher=Addison-Wesley Publishing \| ___location=Reading, Massachusetts \| year=1979 \| isbn=978-0-201-02988-X8 \| zbl=0426.68001 \| url-access=registration \| url=https://archive.org/details/introductiontoau00hopc }} ''(See chapter 3.)'' {{reflist}} {{Strings}} [[Category:Formal languages]]