Parsing expression grammar: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 09:09, 19 January 2024 edit WikiCleanerBot (talk \| contribs) Bots 1,007,735 edits m v2.05b - Bot T20 CW#61 - Fix errors for CW project (Reference before punctuation) Tag: WPCleaner ← Previous edit		Latest revision as of 20:06, 19 June 2025 edit undo OAbot (talk \| contribs) Bots 643,717 edits m Open access bot: url-access updated in citation with #oabot.
(10 intermediate revisions by 9 users not shown)
Line 33: A '''parsing expression grammar''' is a collection of named parsing expressions, which may reference each other. The effect of one such reference in a parsing expression is as if the whole referenced parsing expression was given in place of the reference. A parsing expression grammar also has a designated '''starting expression'''; a string matches the grammar if it matches its starting expression. An element of a string matched is called a ''[[terminal symbol]]'', or '''terminal''' for short. Likewise the names assigned to parsing expressions are called ''[[nonterminal symbol]]s'', or '''nonterminals''' for short. These terms would be descriptive for [[Chomsky hierarchy\|generative grammars]], but in the case of parsing expression grammars they are merely terminology, kept mostly because of being near ~~ubiquous~~ubiquitous in discussions of [[parsing]] algorithms. === Syntax === Both ''abstract'' and ''concrete'' syntaxes of parsing expressions are seen in the literature, and in this article. The abstract syntax is essentially a [[expression (mathematics)\|mathematical formula]] and primarily used in theoretical contexts, whereas concrete syntax parsing expressions could be used directly to control a [[parser]]. The primary concrete syntax is that defined by Ford,<ref name="For04"/>{{rp\|Fig.1}} although many tools have their own dialect of this. Other tools<ref>{{cite web \|last1=Sirthias \|first1=Mathias \|title=Parboiled: Rule Construction in Java \|website=[[GitHub]] \|url=https://github.com/sirthias/parboiled/wiki/Rule-Construction-in-Java \|access-date=13 January 2024}}</ref> can be closer to using a programming-language native encoding of abstract syntax parsing expressions as their concrete syntax. ==== Atomic parsing expressions ==== Line 187: \|class=cs.PL \| eprint = 2005.06444 }}</ref> uses dynamic programming to apply PEG rules bottom-up and right to left, which is the inverse of the normal recursive descent order of top-down, left to right. Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also ~~conveys~~confers optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers. == Advantages == Line 217: PEGs can comfortably be given in terms of characters, whereas [[context-free grammar]]s (CFGs) are usually given in terms of tokens, thus requiring an extra step of tokenisation in front of parsing proper.<ref>CFGs ''can'' be used to describe the syntax of common programming languages down to the character level, but doing so is rather cumbersome, because the standard tokenisation rule that a token consists of the longest consecutive sequence of characters of the same kind does not mesh well with the nondeterministic side of CFGs. To formalise that whitespace between two adjacent tokens is mandatory if the characters on both sides of the token boundary are letters, but optional if they are non-letters, a CFG needs multiple variants of most nonterminals, to keep track of what kind of character has to be at the boundary. If there are <math>3</math> different kinds of non-whitespace characters, that adds up to <math>3^2 = 9</math> possible variants per nonterminal — significantly bloating the grammar.</ref> An advantage of not having a separate tokeniser is that different parts of the language (for example embedded [[Domain-specific language\|mini-language]]s) can easily have different tokenisation rules. In the strict formal sense, PEGs are likely incomparable to CFGs, but practically there are many things that PEGs can do which pure CFGs cannot, whereas it is difficult to come up with examples of the contrary. In particular PEGs can be crafted to natively resolve ambiguities, such as the "[[dangling else]]" problem in C, C++, and Java, whereas CFG-based parsing often needs a rule outside of the grammar to resolve them. Moreover any PEG can be parsed in linear time by using a packrat parser, as described above, whereas parsing according to a general CFG is asymptotically equivalent<ref>{{cite journal \|last1=Lee \|first1=Lillian \|title=Fast Context-free Grammar Parsing Requires Fast Boolean Matrix Multiplication \|journal=J. ACM \|date=January 2002 \|volume=49 \|issue=1 \|page=1–15 \|doi=10.1145/505241.505242\|arxiv=cs/0112018 }}</ref> to [[logical matrix\|boolean matrix]] multiplication (thus likely between quadratic and cubic time). One classical example of a formal language which is provably not context-free is the language <math> \{a^n b^n c^n\}_{n \geqslant 0} </math>: an arbitrary number of <math>a</math>s are followed by an equal number of <math>b</math>s, which in turn are followed by an equal number of <math>c</math>s. This, too, is a parsing expression language, matched by the grammar Line 386: == Practical use == * [[Python (programming language)\|Python]] reference implementation [[CPython]] introduced a PEG parser in version 3.9 as an alternative to the [[LL parser\|LL(1) parser]] and uses just PEG from version 3.10.<ref>{{Cite web \|title=PEP 617 – New PEG parser for CPython {{!}} peps.python.org \|url=https://peps.python.org/pep-0617/ \|access-date=2023-01-16 \|website=peps.python.org}}</ref> * The [[Jq_(programming_language)#Parsing_Expression_Grammars\|jq programming language]] uses a formalism closely related to PEG. * The [[Lua (programming language)\|Lua]] authors created [https://www.inf.puc-rio.br/~roberto/lpeg/ LPeg], a [[pattern-matching]] library that uses PEG instead of [[regular expressions]],<ref>{{cite journal \|last1=Ierusalimschy \|first1=Roberto \|title=A text pattern-matching tool based on Parsing Expression Grammars \|journal=Software: Practice and Experience \|date=10 March 2009 \|volume=39 \|issue=3 \|pages=221–258 \|doi=10.1002/spe.892 \|url=https://onlinelibrary.wiley.com/doi/10.1002/spe.892 \|language=en \|issn=0038-0644\|url-access=subscription }}</ref> as well as the [https://www.inf.puc-rio.br/~roberto/lpeg/re.html re] module which implements a regular-expression-like syntax utilizing the LPeg library.<ref>{{cite web \|last1=Ierusalimschy \|first1=Roberto \|title=LPeg.re - Regex syntax for LPEG \|url=https://www.inf.puc-rio.br/~roberto/lpeg/re.html \|website=www.inf.puc-rio.br}}</ref> == See also ==