Parsing expression grammar: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 13:50, 13 January 2024 edit 37.197.59.228 (talk) →Compared to regular expressions: Simplified example. Elaborated point about nondeterminism. ← Previous edit		Latest revision as of 20:06, 19 June 2025 edit undo OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: url-access updated in citation with #oabot.
(21 intermediate revisions by 11 users not shown)
Line 29: == Definition == {{No footnotes\|section\|date=December 2022}} A '''parsing expression''' is a kind of pattern that each string may either '''match''' or '''not match'''. In case of a match, there is a unique prefix of the string (which may be the whole string, the empty string, or something in between) which has been ''consumed'' by the parsing expression; this prefix is what one would usually think of as having matched the expression. However, whether a string matches a parsing expression ''may'' (because of look-ahead predicates) depend on parts of it which come after the consumed part. A '''parsing expression language''' is a set of all strings that match some specific parsing expression.<ref name="For04" />{{rp\|Sec.3.4}} A '''parsing expression grammar''' is a collection of named parsing expressions, which may reference each other. The effect of one such reference in a parsing expression is as if the whole referenced parsing expression was given in place of the reference. A parsing expression grammar also has a designated '''starting expression'''; a string matches the grammar if it matches its starting expression. An element of a string matched is called a ''[[terminal symbol]]'', or '''terminal''' for short. Likewise the names assigned to parsing expressions are called ''[[nonterminal symbol]]s'', or '''nonterminals''' for short. These terms would be descriptive for [[Chomsky hierarchy\|generative grammars]], but in the case of parsing expression grammars they are merely terminology, kept mostly because of being near ubiquitous in discussions of [[parsing]] algorithms. === Syntax === Both ''abstract'' and ''concrete'' syntaxes of parsing expressions are seen in the literature, and in this article. The abstract syntax is essentially a [[expression (mathematics)\|mathematical formula]] and primarily used in theoretical contexts, whereas concrete syntax parsing expressions could be used directly to control a [[parser]]. The primary concrete syntax is that defined by Ford,<ref name="For04"/>{{rp\|Fig.1}} although many tools have their own dialect of this. Other tools<ref>{{cite web \|last1=Sirthias \|first1=Mathias \|title=Parboiled: Rule Construction in Java \|website=[[GitHub]] \|url=https://github.com/sirthias/parboiled/wiki/Rule-Construction-in-Java \|access-date=13 January 2024}}</ref> can be closer to using a programming-language native encoding of abstract syntax parsing expressions as their concrete syntax. ~~Formally, a parsing expression grammar consists of:~~ * A finite set ''N'' of ''[[nonterminal symbol]]s''. * A finite set Σ of ''[[terminal symbol]]s'' that is [[disjoint sets\|disjoint]] from ''N''. * A finite set ''P'' of ''parsing rules''. * An expression ''e<sub>S</sub>'' termed the ''starting expression''. ==== Atomic parsing expressions ==== Each parsing rule in ''P'' has the form ''A'' ← ''e'', where ''A'' is a nonterminal symbol and ''e'' is a ''parsing expression''. A parsing expression is a hierarchical [[expression (mathematics)\|expression]] similar to a [[regular expression]], which is constructed in the following fashion: The two main kinds of parsing expressions not containing another parsing expression are individual terminal symbols and nonterminal symbols. In concrete syntax, terminals are placed inside quotes (single or double), whereas identifiers not in quotes denote nonterminals: ~~# An ''atomic parsing expression'' consists of:~~ <syntaxhighlight lang="peg"> ~~#* any terminal symbol,~~ "terminal" Nonterminal 'another terminal' ~~#* any nonterminal symbol, or~~ </syntaxhighlight> ~~#* the empty string ε.~~ In the abstract syntax there is no formalised distinction, instead each symbol is supposedly defined as either terminal or nonterminal, but a common convention is to use upper case for nonterminals and lower case for terminals. ~~# Given any existing parsing expressions ''e'', ''e''<sub>1</sub>, and ''e''<sub>2</sub>, a new parsing expression can be constructed using the following operators:~~ ~~#* ''Sequence'': ''e''<sub>1</sub> ''e''<sub>2</sub>~~ The concrete syntax also has a number of forms for classes of terminals: ~~#* ''Ordered choice'': ''e''<sub>1</sub> / ''e''<sub>2</sub>~~ * A <code>.</code> (period) is a parsing expression matching any single terminal. #* ''Zero-or-more'': ''e''* * Brackets around a list of characters <code>[abcde]</code> form a parsing expression matching one of the numerated characters. As in [[regular expression]]s, these classes may also include ranges <code>[0-9A-Za-z]</code> written as a hyphen with the range endpoints before and after it. (Unlike the case in regular expressions, bracket character classes do not have <code>^</code> for negation; that end can instead be had via not-predicates.) ~~#* ''One-or-more'': ''e''+~~ * Some dialects have further notation for predefined classes of characters, such as letters, digits, punctuation marks, or spaces; this is again similar to the situation in regular expressions. ~~#* ''Optional'': ''e''?~~ In abstract syntax, such forms are usually formalised as nonterminals whose exact definition is elided for brevity; in Unicode, there are tens of thousands of characters that are letters. Conversely, theoretical discussions sometimes introduce atomic abstract syntax for concepts that can alternatively be expressed using composite parsing expressions. Examples of this include: ~~#* ''And-predicate'': &''e''~~ * the empty string ε (as a parsing expression, it matches every string and consumes no characters), ~~#* ''Not-predicate'': !''e''~~ * end of input ''E'' (concrete syntax equivalent is <code>!.</code>), and ~~#* ''Group'': (''e'')~~ * failure <math>\bot</math> (matches nothing). ~~# Operator priorities are as follows, based on Table 1 in:<ref name="For04" />~~ In the concrete syntax, quoted and bracketed terminals have backslash escapes, so that "[[line feed]] or [[carriage return]]" may be written <code>[\n\r]</code>. The abstract syntax counterpart of a quoted terminal of length greater than one would be the sequence of those terminals; <code>"bar"</code> is the same as <code>"b" "a" "r"</code>. The primary concrete syntax assigns no distinct meaning to terminals depending on whether they use single or double quotes, but some dialects treat one as case-sensitive and the other as case-insensitive. ==== Composite parsing expressions ==== Given any existing parsing expressions ''e'', ''e''<sub>1</sub>, and ''e''<sub>2</sub>, a new parsing expression can be constructed using the following operators: * ''Sequence'': ''e''<sub>1</sub> ''e''<sub>2</sub> * ''Ordered choice'': ''e''<sub>1</sub> / ''e''<sub>2</sub> * ''Zero-or-more'': ''e''* * ''One-or-more'': ''e''+ * ''Optional'': ''e''? * ''And-predicate'': &''e'' * ''Not-predicate'': !''e'' * ''Group'': (''e'') Operator priorities are as follows, based on Table 1 in:<ref name="For04" /> {\| class="wikitable" ! Operator !! Priority Line 64 ⟶ 80: \| ''e''<sub>1</sub> / ''e''<sub>2</sub> \|\| 1 \|} ==== Grammars ==== In the concrete syntax, a parsing expression grammar is simply a sequence of nonterminal definitions, each of which has the form <syntaxhighlight lang="peg"> Identifier LEFTARROW Expression </syntaxhighlight> The <code>Identifier</code> is the nonterminal being defined, and the <code>Expression</code> is the parsing expression it is defined as referencing. The <code>LEFTARROW</code> varies a bit between dialects, but is generally some left-pointing arrow or assignment symbol, such as <code><-</code>, <code>←</code>, <code>:=</code>, or <code>=</code>. One way to understand it is precisely as making an assignment or definition of the nonterminal. Another way to understand it is as a contrast to the right-pointing arrow → used in the rules of a [[context-free grammar]]; with parsing expressions the flow of information goes from expression to nonterminal, not nonterminal to expression. As a mathematical object, a parsing expression grammar is a tuple <math>(N,\Sigma,P,e_S)</math>, where <math>N</math> is the set of nonterminal symbols, <math>\Sigma</math> is the set of terminal symbols, <math>P</math> is a [[Function (mathematics)\|function]] from <math>N</math> to the set of parsing expressions on <math>N \cup \Sigma</math>, and <math>e_S</math> is the starting parsing expression. Some concrete syntax dialects give the starting expression explicitly,<ref name="ptKupries">{{cite web \|last1=Kupries \|first1=Andreas \|title=pt::peg_language - PEG Language Tutorial \|url=https://core.tcl-lang.org/tcllib/doc/tcllib-1-21/embedded/md/tcllib/files/modules/pt/pt_peg_language.md \|website=Tcl Library Source Code \|access-date=14 January 2024}}</ref> but the primary concrete syntax instead has the implicit rule that the first nonterminal defined is the starting expression. It is worth noticing that the primary dialect of concrete syntax parsing expression grammars does not have an explicit definition terminator or separator between definitions, although it is customary to begin a new definition on a new line; the <code>LEFTARROW</code> of the next definition is sufficient for finding the boundary, if one adds the constraint that a nonterminal in an <code>Expression</code> must not be followed by a <code>LEFTARROW</code>. However, some dialects may allow an explicit terminator, or outright require<ref name="ptKupries"/> it. === Example === This is a PEG that recognizes mathematical formulas that apply the basic five operations to non-negative integers. <syntaxhighlight lang="peg"> Expr ← Sum Sum ← Product (('+' / '-') Product)* Product ← Power (('' / '/') Power) Power ← Value ('^' Power)? Value ← [0-9]+ / '(' Expr ')' </syntaxhighlight> In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as <code>'('</code> and <code>')'</code>. The range <code>[0-9]</code> is a shortcut for the ten characters from <code>'0'</code> to <code>'9'</code>. (This range syntax is the same as the syntax used by [[regular expression]]s.) The nonterminal symbols are the ones that expand to other rules: ''Value'', ''Power'', ''Product'', ''Sum'', and ''Expr''. Note that rules ''Sum'' and ''Product'' don't lead to desired left-associativity of these operations (they don't deal with associativity at all, and it has to be handled in post-processing step after parsing), and the ''Power'' rule (by referring to itself on the right) results in desired right-associativity of exponent. Also note that a rule like {{code\|Sum ← Sum (('+' / '-') Product)?\|peg}} (with intention to achieve left-associativity) would cause infinite recursion, so it cannot be used in practice even though it can be expressed in the grammar. === Semantics === Line 92 ⟶ 130: The '''not-predicate''' expression !''e'' succeeds if ''e'' fails and fails if ''e'' succeeds, again consuming no input in either case. === ~~Examples~~More examples === ~~This is a PEG that recognizes mathematical formulas that apply the basic five operations to non-negative integers.~~ ~~<syntaxhighlight lang="peg">~~ ~~Expr ← Sum~~ Sum ← Product (('+' / '-') Product)* Product ← Power (('' / '/') Power) ~~Power ← Value ('^' Power)?~~ ~~Value ← [0-9]+ / '(' Expr ')'~~ ~~</syntaxhighlight>~~ In the above example, the terminal symbols are characters of text, represented by characters in single quotes, such as <code>'('</code> and <code>')'</code>. The range <code>[0-9]</code> is a shortcut for the ten characters from <code>'0'</code> to <code>'9'</code>. (This range syntax is the same as the syntax used by [[regular expression]]s.) The nonterminal symbols are the ones that expand to other rules: ''Value'', ''Power'', ''Product'', ''Sum'', and ''Expr''. Note that rules ''Sum'' and ''Product'' don't lead to desired left-associativity of these operations (they don't deal with associativity at all, and it has to be handled in post-processing step after parsing), and the ''Power'' rule (by referring to itself on the right) results in desired right-associativity of exponent. Also note that a rule like {{code\|Sum ← Sum (('+' / '-') Product)?\|peg}} (with intention to achieve left-associativity) would cause infinite recursion, so it cannot be used in practice even though it can be expressed in the grammar. The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a [[context-free grammar]], this construct yields the classic [[dangling else\|dangling else ambiguity]].) <syntaxhighlight lang="peg"> Line 108 ⟶ 136: </syntaxhighlight> The following recursive rule matches Pascal-style nested comment syntax, {{code\|(* which can (* nest ) like this )}}. ~~The~~Recall ~~comment~~that ~~symbols~~{{code\|.\|peg}} ~~appear~~matches inany single ~~quotes to distinguish them from PEG operators~~character. <syntaxhighlight lang="peg"> C ← Begin N* End Begin ← '(' End ← ')' CN ← ~~Begin~~C N/ (!Begin !End .) ~~N ← C / (!Begin !End Z)~~ ~~Z ← any single character~~ </syntaxhighlight> Line 160 ⟶ 187: \|class=cs.PL \| eprint = 2005.06444 }}</ref> uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also ~~conveys~~confers optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers. == Advantages == Line 190 ⟶ 217: PEGs can comfortably be given in terms of characters, whereas [[context-free grammar]]s (CFGs) are usually given in terms of tokens, thus requiring an extra step of tokenisation in front of parsing proper.<ref>CFGs ''can'' be used to describe the syntax of common programming languages down to the character level, but doing so is rather cumbersome, because the standard tokenisation rule that a token consists of the longest consecutive sequence of characters of the same kind does not mesh well with the nondeterministic side of CFGs. To formalise that whitespace between two adjacent tokens is mandatory if the characters on both sides of the token boundary are letters, but optional if they are non-letters, a CFG needs multiple variants of most nonterminals, to keep track of what kind of character has to be at the boundary. If there are <math>3</math> different kinds of non-whitespace characters, that adds up to <math>3^2 = 9</math> possible variants per nonterminal — significantly bloating the grammar.</ref> An advantage of not having a separate tokeniser is that different parts of the language (for example embedded [[Domain-specific language\|mini-language]]s) can easily have different tokenisation rules. In the strict formal sense, PEGs are likely incomparable to CFGs, but practically there are many things that PEGs can do which pure CFGs cannot, whereas it is difficult to come up with examples of the contrary. In particular PEGs can be crafted to natively resolve ambiguities, such as the "[[dangling else]]" problem in C, C++, and Java, whereas CFG-based parsing often needs a rule outside of the grammar to resolve them. Moreover any PEG can be parsed in linear time by using a packrat parser, as described above, whereas parsing according to a general CFG is asymptotically equivalent<ref>{{cite journal \|last1=Lee \|first1=Lillian \|title=Fast Context-free Grammar Parsing Requires Fast Boolean Matrix Multiplication \|journal=J. ACM \|date=January 2002 \|volume=49 \|issue=1 \|page=1–15 \|doi=10.1145/505241.505242\|arxiv=cs/0112018 }}</ref> to [[logical matrix\|boolean matrix]] multiplication (thus likely between quadratic and cubic time). One classical example of a formal language which is provably not context-free is the language <math> \{a^n b^n c^n\}_{n \geqslant 0} </math>: an arbitrary number of <math>a</math>s are followed by an equal number of <math>b</math>s, which in turn are followed by an equal number of <math>c</math>s. This, too, is a parsing expression language, matched by the grammar Line 221 ⟶ 248: For recursive grammars and some inputs, the depth of the parse tree can be proportional to the input size,<ref>for example, the LISP expression (x (x (x (x ....))))</ref> so both an LR parser and a packrat parser will appear to have the same worst-case asymptotic performance. However in many domains, for example hand-written source code, the expression nesting depth has an effectively constant bound quite independent of the length of the program, because expressions nested beyond a certain depth tend to get [[refactor]]ed. When it is not necessary to keep the full parse tree, a more accurate analysis would take the depth of the parse tree into account separately from the input size.<ref>This is similar to a situation which arises in [[graph algorithms]]: the [[Bellman–Ford algorithm]] and [[Floyd–Warshall algorithm]] appear to have the same running time (<math>O(\|V\|^3)</math>) if only the number of vertices is considered. However, a more precise analysis which accounts for the number of edges as a separate parameter assigns the [[Bellman–Ford algorithm]] a time of <math>O(\|V\|\|E\|)</math>, which is quadratic for sparse graphs with <math>\|E\| \in O(\|V\|)</math>.</ref> === Computational model === In order to attain linear overall complexity, the storage used for memoization must furthermore provide [[amortized analysis\|amortized constant time]] access to individual data items memoized. In practice that is no problem — for example a dynamically sized [[hash table]] attains this – but that makes use of [[Pointer (computer programming)\|pointer]] arithmetic, so it presumes having a [[random-access machine]]. Theoretical discussions of data structures and algorithms have an unspoken tendency to presume a more restricted model (possibly that of [[lambda calculus]], possibly that of [[Scheme (programming language)\|Scheme]]), where a sparse table rather has to be built using trees, and data item access is not constant time. Traditional parsing algorithms such as the [[LL parser]] are not affected by this, but it becomes a penalty for the reputation of packrat parsers: they rely on operations of seemingly ill repute. Viewed the other way around, this says packrat parsers tap into computational power readily available in real life systems, that older parsing algorithms do not understand to employ. === Indirect left recursion === Line 235 ⟶ 267: <syntaxhighlight lang="peg"> Sum ← Term ( '+' Term / '-' Term )* Args →← Arg ( ',' Arg )* </syntaxhighlight> A difference lies in the [[abstract syntax tree]]s generated: with recursion each <code>Sum</code> or <code>Args</code> can have at most two children, but with repetition there can be arbitrarily many. If later stages of processing require that such lists of children are recast as trees with bounded [[degree (graph theory)\|degree]], for example microprocessor instructions for addition typically only allow two operands, then properties such as [[left-associative\|left-associativity]] would be imposed after the PEG-directed parsing stage. Line 315 ⟶ 347: ↑ Pos.1: First branch of S succeeds, yielding match of length 3. Matching against a parsing expression is [[greedy algorithm\|greedy]], in the sense that the first success encountered is the only one considered. Even if locally the choices are ordered longest first, there is no guarantee that this greedy matching will find the globally longest match. It is an open problem to give a concrete example of a context-free language which cannot be recognized by a parsing expression grammar.<ref name="For04" /> In particular, it is open whether a parsing expression grammar can recognize the language of palindromes.<ref>{{cite arXiv \|last1=Loff \|first1=Bruno \|last2=Moreira \|first2=Nelma \|last3=Reis \|first3=Rogério \|date=2020-02-14 \|title=The computational power of parsing expression grammars \|class=cs.FL \|eprint=1902.08272 }}</ref> === Ambiguity detection and influence of rule order on language that is matched === Line 336 ⟶ 366: S → start(G1) \| start(G2) </syntaxhighlight> == Theory of parsing expression grammars == It is an open problem to give a concrete example of a context-free language which cannot be recognized by a parsing expression grammar.<ref name="For04" /> In particular, it is open whether a parsing expression grammar can recognize the language of palindromes.<ref>{{cite arXiv \|last1=Loff \|first1=Bruno \|last2=Moreira \|first2=Nelma \|last3=Reis \|first3=Rogério \|date=2020-02-14 \|title=The computational power of parsing expression grammars \|class=cs.FL \|eprint=1902.08272 }}</ref> The class of parsing expression languages is closed under set intersection and complement, thus also under set union.<ref name="For04" />{{rp\|Sec.3.4}} === Undecidability of emptiness === In stark contrast to the case for context-free grammars, it is not possible to generate elements of a parsing expression language from its grammar. Indeed, it is algorithmically [[undecidable problem\|undecidable]] whether the language recognised by a parsing expression grammar is empty! One reason for this is that any instance of the [[Post correspondence problem]] reduces to an instance of the problem of deciding whether a parsing expression language is empty. Recall that an instance of the Post correspondence problem consists of a list <math> (\alpha_1,\beta_1), (\alpha_2,\beta_2), \dotsc, (\alpha_n,\beta_n) </math> of pairs of strings (of terminal symbols). The problem is to determine whether there exists a sequence <math>\{k_i\}_{i=1}^m</math> of indices in the range <math>\{1,\dotsc,n\}</math> such that <math> \alpha_{k_1} \alpha_{k_2} \dotsb \alpha_{k_m} = \beta_{k_1} \beta_{k_2} \dotsb \beta_{k_m} </math>. To [[reduction (complexity)\|reduce]] this to a parsing expression grammar, let <math> \gamma_0, \gamma_1, \dotsc, \gamma_n </math> be arbitrary pairwise distinct equally long strings of terminal symbols (already with <math>2</math> distinct symbols in the terminal symbol alphabet, length <math>\lceil \log_2(n+1) \rceil</math> suffices) and consider the parsing expression grammar <math display="block"> \begin{aligned} S &\leftarrow \&(A \, !.) \&(B \, !.) (\gamma_1/\dotsb/\gamma_n)^+ \gamma_0 \\ A &\leftarrow \gamma_0 / \gamma_1 A \alpha_1 / \dotsb / \gamma_n A \alpha_n \\ B &\leftarrow \gamma_0 / \gamma_1 B \beta_1 / \dotsb / \gamma_n B \beta_n \end{aligned} </math> Any string matched by the nonterminal <math>A</math> has the form <math>\gamma_{k_m} \dotsb \gamma_{k_2} \gamma_{k_1} \gamma_0 \alpha_{k_1} \alpha_{k_2} \dotsb \alpha_{k_m}</math> for some indices <math>k_1,k_2,\dotsc,k_m</math>. Likewise any string matched by the nonterminal <math>B</math> has the form <math>\gamma_{k_m} \dotsb \gamma_{k_2} \gamma_{k_1} \gamma_0 \beta_{k_1} \beta_{k_2} \dotsb \beta_{k_m}</math>. Thus any string matched by <math>S</math> will have the form <math>\gamma_{k_m} \dotsb \gamma_{k_2} \gamma_{k_1} \gamma_0 \rho</math> where <math> \rho = \alpha_{k_1} \alpha_{k_2} \dotsb \alpha_{k_m} = \beta_{k_1} \beta_{k_2} \dotsb \beta_{k_m}</math>. == Practical use == * [[Python (programming language)\|Python]] reference implementation [[CPython]] introduced a PEG parser in version 3.9 as an alternative to the [[LL parser\|LL(1) parser]] and uses just PEG from version 3.10.<ref>{{Cite web \|title=PEP 617 – New PEG parser for CPython {{!}} peps.python.org \|url=https://peps.python.org/pep-0617/ \|access-date=2023-01-16 \|website=peps.python.org}}</ref> ~~The [[Jq_(programming_language)#Parsing_Expression_Grammars\|jq programming language]] uses formalism closely related to PEG.~~ * The [[Jq_(programming_language)#Parsing_Expression_Grammars\|jq programming language]] uses a formalism closely related to PEG. * The [[Lua (programming language)\|Lua]] authors created [https://www.inf.puc-rio.br/~roberto/lpeg/ LPeg], a [[pattern-matching]] library that uses PEG instead of [[regular expressions]],<ref>{{cite journal \|last1=Ierusalimschy \|first1=Roberto \|title=A text pattern-matching tool based on Parsing Expression Grammars \|journal=Software: Practice and Experience \|date=10 March 2009 \|volume=39 \|issue=3 \|pages=221–258 \|doi=10.1002/spe.892 \|url=https://onlinelibrary.wiley.com/doi/10.1002/spe.892 \|language=en \|issn=0038-0644\|url-access=subscription }}</ref> as well as the [https://www.inf.puc-rio.br/~roberto/lpeg/re.html re] module which implements a regular-expression-like syntax utilizing the LPeg library.<ref>{{cite web \|last1=Ierusalimschy \|first1=Roberto \|title=LPeg.re - Regex syntax for LPEG \|url=https://www.inf.puc-rio.br/~roberto/lpeg/re.html \|website=www.inf.puc-rio.br}}</ref> == See also ==