Parsing expression grammar: Difference between revisions

Content deleted Content added
Syntax: Abstract versus concrete syntax
Definition: Terminals and nonterminals. Subsections for atomic and composite parsing expressions, as well as grammars.
Line 32:
 
A '''parsing expression grammar''' is a collection of named parsing expressions, which may reference each other. The effect of one such reference in a parsing expression is as if the whole referenced parsing expression was given in place of the reference. A parsing expression grammar also has a designated '''starting expression'''; a string matches the grammar if it matches its starting expression.
 
An element of a string matched is called a ''[[terminal symbol]]'', or '''terminal''' for short. Likewise the names assigned to parsing expressions are called ''[[nonterminal symbol]]s'', or '''nonterminals''' for short. These terms would be descriptive for [[Chomsky hierarchy|generative grammars]], but in the case of parsing expression grammars they are merely terminology, kept mostly because of being near ubiquous in discussions of [[parsing]] algorithms.
 
=== Syntax ===
Both ''abstract'' and ''concrete'' syntaxes of parsing expressions are seen in the literature, and in this article. The abstract syntax is essentially a [[expression (mathematics)|mathematical formula]] and primarily used in theoretical contexts, whereas concrete syntax parsing expressions could be used directly to control a [[parser]]. The primary concrete syntax is that defined by Ford<ref name="For04"/>{{rp|Fig.1}}, although many tools have their own dialect of this. Other tools<ref>{{cite web |last1=Sirthias |first1=Mathias |title=Parboiled: Rule Construction in Java |url=https://github.com/sirthias/parboiled/wiki/Rule-Construction-in-Java |access-date=13 January 2024}}</ref> can be closer to using a programming-language native encoding of abstract syntax parsing expressions as their concrete syntax.
 
==== Atomic parsing expressions ====
Formally, a parsing expression grammar consists of:
The two main kinds of parsing expressions not containing another parsing expression are individual terminal symbols and nonterminal symbols. In concrete syntax, terminals are placed inside quotes (single or double), whereas identifiers not in quotes denote nonterminals:
* A finite set ''N'' of ''[[nonterminal symbol]]s''.
<syntaxhighlight lang="peg">
* A finite set Σ of ''[[terminal symbol]]s'' that is [[disjoint sets|disjoint]] from ''N''.
"terminal" Nonterminal 'another terminal'
* A finite set ''P'' of ''parsing rules''.
</syntaxhighlight>
* An expression ''e<sub>S</sub>'' termed the ''starting expression''.
In the abstract syntax there is no formalised distinction, instead each symbol is supposedly defined as either terminal or nonterminal, but a common convention is to use upper case for nonterminals and lower case for terminals.
 
The concrete syntax also has a number of forms for classes of terminals:
Each parsing rule in ''P'' has the form ''A'' ← ''e'', where ''A'' is a nonterminal symbol and ''e'' is a ''parsing expression''. A parsing expression is a hierarchical [[expression (mathematics)|expression]] similar to a [[regular expression]], which is constructed in the following fashion:
#* AnA ''atomic<code>.</code> (period) is a parsing expression'' consistsmatching any single of:terminal.
* Brackets around a list of characters <code>[abcde]</code> form a parsing expression matching one of the numerated characters. As in [[regular expression]]s, these classes may also include ranges <code>[0-9A-Za-z]</code> written as a hyphen with the range endpoints before and after it. (Unlike the case in regular expressions, bracket character classes do not have <code>^</code> for negation; that end can instead be had via not-predicates.)
#* any terminal symbol,
* Some dialects have further notation for predefined classes of characters, such as letters, digits, punctuation marks, or spaces; this is again similar to the situation in regular expressions.
#* any nonterminal symbol, or
In abstract syntax, such forms are usually formalised as nonterminals whose exact definition is elided for brevity; in Unicode, there are tens of thousands of characters that are letters. Conversely, theoretical discussions sometimes introduce atomic abstract syntax for concepts that can alternatively be expressed using composite parsing expressions. Examples of this include:
#* the empty string ε.
* the empty string ε (as a parsing expression, it matches every string and consumes no characters),
# Given any existing parsing expressions ''e'', ''e''<sub>1</sub>, and ''e''<sub>2</sub>, a new parsing expression can be constructed using the following operators:
* end of input ''E'' (concrete syntax equivalent is <code>!.</code>), and
#* ''Sequence'': ''e''<sub>1</sub> ''e''<sub>2</sub>
* failure <math>\bot</math> (matches nothing).
#* ''Ordered choice'': ''e''<sub>1</sub> / ''e''<sub>2</sub>
 
#* ''Zero-or-more'': ''e''*
In the concrete syntax, quoted and bracketed terminals have backslash escapes, so that "[[line feed]] or [[carriage return]]" may be written <code>[\n\r]</code>. The abstract syntax counterpart of a quoted terminal of length greater than one would be the sequence of those terminals; <code>"bar"</code> is the same as <code>"b" "a" "r"</code>. The primary concrete syntax assigns no distinct meaning to terminals depending on whether they use single or double quotes, but some dialects treat one as case-sensitive and the other as case-insensitive.
#* ''One-or-more'': ''e''+
 
#* ''Optional'': ''e''?
==== Composite parsing expressions ====
#* ''And-predicate'': &''e''
# Given any existing parsing expressions ''e'', ''e''<sub>1</sub>, and ''e''<sub>2</sub>, a new parsing expression can be constructed using the following operators:
#* ''Not-predicate'': !''e''
#* ''GroupSequence'': (''e'')<sub>1</sub> ''e''<sub>2</sub>
#* ''SequenceOrdered choice'': ''e''<sub>1</sub> / ''e''<sub>2</sub>
# Operator priorities are as follows, based on Table 1 in:<ref name="For04" />
#* ''Zero-or-more'': ''e''*
#* ''One-or-more'': ''e''+
#* ''Optional'': ''e''?
#* ''And-predicate'': &''e''
#* ''Not-predicate'': !''e''
* ''Group'': (''e'')
# Operator priorities are as follows, based on Table 1 in:<ref name="For04" />
{| class="wikitable"
! Operator !! Priority
Line 70 ⟶ 80:
| ''e''<sub>1</sub> / ''e''<sub>2</sub> || 1
|}
 
==== Grammars ====
As a mathematical object, a parsing expression grammar is a tuple <math>(N,\Sigma,P,e_S)</math>, where <math>N</math> is the set of nonterminal symbols, <math>\Sigma</math> is the set of terminal symbols, <math>P</math> is a [[Function (mathematics)|function]] from <math>N</math> to the set of parsing expressions on <math>N \cup \Sigma</math>, and <math>e_S</math> is the starting parsing expression.
 
<!--- old text:
Formally, a parsing expression grammar consists of:
* A finite set ''N'' of ''[[nonterminal symbol]]s''.
* A finite set Σ of ''[[terminal symbol]]s'' that is [[disjoint sets|disjoint]] from ''N''.
* A finite set ''P'' of ''parsing rules''.
* An expression ''e<sub>S</sub>'' termed the ''starting expression''.
 
Each parsing rule in ''P'' has the form ''A'' ← ''e'', where ''A'' is a nonterminal symbol and ''e'' is a ''parsing expression''. A parsing expression is a hierarchical [[expression (mathematics)|expression]] similar to a [[regular expression]], which is constructed in the following fashion:
--->
 
=== Semantics ===