Operator-precedence parser: Difference between revisions

Browse history interactively

← Previous edit

Content deleted Content added

VisualWikitext

Revision as of 05:24, 8 May 2008 edit Gulliveig (talk \| contribs) Extended confirmed users 738 edits No edit summary ← Previous edit		Latest revision as of 20:16, 15 August 2025 edit undo InternetArchiveBot (talk \| contribs) Bots, Pending changes reviewers 5,696,205 edits Rescuing 1 sources and tagging 0 as dead.) #IABot (v2.0.9.5
(108 intermediate revisions by 83 users not shown)
Line 1: {{Short description\|Bottom-up parser that interprets an operator-precedence grammar}} An '''operator precedence parser''' is a computer program that interprets an [[operator-precedence grammar]]. For example, most [[calculator]]s use operator precedence parsers to convert from [[infix notation]] with [[order of operations]] (the usual format humans use for mathematical expressions) into a different format they use internally to compute the result. In [[computer science]], an '''operator-precedence parser''' is a [[Bottom-up parsing\|bottom-up parser]] that interprets an [[operator-precedence grammar]]. For example, most [[calculator]]s use operator-precedence parsers to convert from the human-readable [[infix notation]] relying on [[order of operations]] to a format that is optimized for evaluation such as [[Reverse Polish notation]] (RPN). [[Edsger ~~Dijkstra\|~~Dijkstra]]'s [[shunting yard algorithm]] ~~(named after the [[classification yard\|shunting yard]])~~ is commonly used to implement operator -precedence parsers ~~which convert from infix notation to [[Reverse Polish notation]] (RPN)~~. == Relationship to other parsers == An operator-precedence parser is a simple [[shift-reduce parser]] that is capable of parsing a subset of [[LR parser\|LR(1)]] grammars. More precisely, the operator-precedence parser can parse all LR(1) grammars where two consecutive [[nonterminal]]s and [[empty string\|epsilon]] never appear in the right-hand side of any rule. Operator-precedence parsers are not used often in practice,; however they do have some properties that make them useful within a larger design. First, they are simple enough to write by hand, which is not generally the case with more sophisticated right shift-reduce parsers. Second, they can be written to consult an operator table at [[~~runtime~~Run time (program lifecycle phase)\|run time]], which makes them suitable for languages that can add to or change their operators while parsing. (An example is [[Haskell (programming language)\|Haskell]], which allows user-defined infix operators with custom associativity and precedence; consequently, an operator-precedence parser must be run on the program ''after'' parsing of all referenced modules.) [[~~Perl~~Raku 6(programming language)\|Raku]] "sandwiches" an operator-precedence parser in between two [[~~Recursive~~recursive descent parser]]s in order to achieve a balance of speed and dynamism~~. This is expressed in the virtual machine for Perl 6, [[Parrot virtual machine\|Parrot]] as the [[Parser Grammar Engine]] (PGE)~~. [[GNU Compiler Collection\|GCC]]'s C and C++ parsers, which are hand-coded [[recursive descent ~~parser]]s~~parsers, are both sped up by an operator-precedence parser that can quickly examine arithmetic expressions. Operator-precedence parsers are also embedded within [[compiler-compiler]]-generated parsers to noticeably speed up the recursive descent approach to expression parsing.<ref name="Harwell2008">{{cite web\| last=Harwell \| first=Sam \| date=2008-08-29 \| title=Operator precedence parser \| url=https://theantlrguy.atlassian.net/wiki/spaces/ANTLR3/pages/2687077/Operator+precedence+parser \| publisher=ANTLR3 Wiki \| access-date=2017-10-25}}</ref> == Precedence climbing method == ~~Bottom -up parsers are divided into 2 categories.~~ The precedence climbing method is a compact, efficient, and flexible algorithm for parsing expressions that was first described by Martin Richards and Colin Whitby-Strevens.<ref name="Richards1979">{{cite book \|last1=Richards \|first1=Martin \|last2=Whitby-Strevens \|first2=Colin \|title=BCPL — the language and its compiler \|year=1979 \|publisher=Cambridge University Press \|isbn=9780521219655}}</ref> ~~# Operator-precedence parser~~ ~~# LR Parsers~~ An infix-notation expression grammar in [[EBNF]] format will usually look like this: ~~To show that one grammar is '''operator precedence''', first it should be '''operator grammar.~~ ~~'''~~ ~~Operator precedence grammar is the only grammar which can construct the parse tree even though the given grammar is ambiguous.~~ <syntaxhighlight lang="abnf"> ~~== Example algorithm to parse infix notation ==~~ expression ::= equality-expression equality-expression ::= additive-expression ( ( '==' \| '!=' ) additive-expression ) * additive-expression ::= multiplicative-expression ( ( '+' \| '-' ) multiplicative-expression ) * multiplicative-expression ::= primary ( ( '' \| '/' ) primary ) primary ::= '(' expression ')' \| NUMBER \| VARIABLE \| '-' primary </syntaxhighlight> With many levels of precedence, implementing this grammar with a predictive recursive-descent parser can become inefficient. Parsing a number, for example, can require five function calls: one for each non-terminal in the grammar until reaching ''primary''. ~~An [[EBNF]] grammar that parses infix notation will usually look like this:~~ An operator-precedence parser can do the same more efficiently.<ref name="Harwell2008"/> The idea is that we can left associate the arithmetic operations as long as we find operators with the same precedence, but we have to save a temporary result to evaluate higher precedence operators. The algorithm that is presented here does not need an explicit stack; instead, it uses recursive calls to implement the stack. ~~<pre>~~ ~~expression ::= equality-expression~~ equality-expression ::= additive-expression ( ( '==' \| '!=' ) additive-expression ) * additive-expression ::= multiplicative-expression ( ( '+' \| '-' ) multiplicative-expression ) * multiplicative-expression ::= primary ( ( '' \| '/' ) primary ) ~~primary ::= '(' expression ')' \| NUMBER \| VARIABLE \| '-' primary~~ ~~</pre>~~ The algorithm is not a pure operator-precedence parser like the Dijkstra shunting yard algorithm. It assumes that the ''primary'' nonterminal is parsed in a separate subroutine, like in a recursive descent parser. With many levels of precedence, implementing this grammar with a [[Recursive descent parser\|predictive recursive-descent parser]] can become inefficient. Parsing a number, for example, can require five function calls (one for each non-terminal in the grammar, until we reach <em>primary</em>). === Pseudocode === An operator-precedence parser can do the same more efficiently. The idea is that we can left associate the arithmetic operations as long as we find operators with the same precedence, but we have to save a temporary result to evaluate higher precedence operators. The algorithm that is presented here does not need an explicit stack: instead, it uses recursive calls to implement the stack. The pseudocode for the algorithm is as follows. The parser starts at function ''parse_expression''. Precedence levels are greater than or equal to 0. The algorithm is not a pure operator-precedence parser like the Dijkstra [[shunting yard algorithm]]. It assumes that the <em>primary</em> nonterminal is parsed in a separate subroutine, like in a [[recursive descent parser]]. <u>parse_expression()</u> ~~=== Pseudo-code ===~~ '''return''' parse_expression_1(parse_primary(), 0) <u>parse_expression_1(lhs, min_precedence)</u> ~~The pseudo-code for the algorithm is as follows. The parser starts at function <em>parse_expression</em>. Precedence levels are greater or equal to 0.~~ ''lookahead'' := peek next token '''while''' ''lookahead'' is a binary operator whose precedence is >= ''min_precedence'' ~~''parse_expression'' ()~~ ~~return~~ ''~~parse_expression_1~~op'' (:= ''~~parse_primary~~lookahead'' ~~(), 0)~~ advance to next token ~~''parse_expression_1 (lhs, min_precedence)~~ ~~while the next token is a binary operator whose precedence is >= ''min_precedence''~~ ~~''op'' := next token~~ ''rhs'' := ''parse_primary'' () ~~while~~''lookahead'' ~~the~~:= peek next token ~~is a binary operator whose precedence is greater~~ '''while''' ''lookahead'' is a binary operator whose precedence is greater than ''op''<nowiki>'</nowiki>s, or a right-associative operator whose precedence is equal to ''op'''s ''rhs'' := ''parse_expression_1'' (''rhs'', precedence of ''op'' + (1 if ''lookahead'' precedence is greater, else 0)) ~~''lookahead'' := next token~~ ''~~rhs~~lookahead'' := ~~''parse_expression_1''~~peek ~~(''rhs'', lookahead's~~next ~~precedence)~~token ''lhs'' := the result of applying ''op'' with operands ''lhs'' and ''rhs'' '''return''' ''lhs'' Note that in the case of a production rule like this (where the operator can only appear once): <syntaxhighlight lang="abnf"> equality-expression ::= additive-expression ( '==' \| '!=' ) additive-expression </syntaxhighlight> the algorithm must be modified to accept only binary operators whose precedence is > ''min_precedence''. === Example execution of the algorithm === Line 61 ⟶ 65: An example execution on the expression 2 + 3 * 4 + 5 == 19 is as follows. We give precedence 0 to equality expressions, 1 to additive expressions, 2 to multiplicative expressions. ~~<em>~~''parse_expression_1~~</em>~~'' (~~<em>~~''lhs~~</em>~~'' = 2, ~~<em>~~''min_precedence~~</em>~~'' = 0) * the ~~next~~lookahead token is +, with precedence 1. the outer while loop is entered. * ~~<em>~~''op~~</em>~~'' is + (precedence 1) and the input is advanced * ~~<em>~~''rhs~~</em>~~'' is 3 * the ~~next~~lookahead token is , with precedence 2. ~~recursive~~the ~~invocation~~inner while loop is entered.<br>~~<em>~~''parse_expression_1~~</em>~~'' (~~<em>~~''lhs~~</em>~~'' = 3, ~~<em>~~''min_precedence~~</em>~~'' = 2) : the ~~next~~lookahead token is , with precedence 2. the outer while loop is entered. :: ~~<em>~~''op~~</em>~~'' is * (precedence 2) and the input is advanced ::* ~~<em>~~''rhs~~</em>~~'' is 4 ::* the next token is +, with precedence 1. nothe ~~recursive~~inner ~~invocation~~while loop is not entered. ::* ~~<em>~~''lhs~~</em>~~'' is assigned 34 = 12 :: the next token is +, with precedence 1. the outer while loop is left. :* 12 is returned. * the ~~next~~lookahead token is +, with precedence 1. nothe ~~recursive~~inner ~~invocation~~while loop is not entered. * ~~<em>~~''lhs~~</em>~~'' is assigned 2+12 = 14 * the ~~next~~lookahead token is +, with precedence 1. the outer while loop is not left. * ~~<em>~~''op~~</em>~~'' is + (precedence 1) and the input is advanced * ~~<em>~~''rhs~~</em>~~'' is 5 * the next token is ==, with precedence 0. nothe ~~recursive~~inner ~~invocation~~while loop is not entered. * ~~<em>~~''lhs~~</em>~~'' is assigned 14+5 = 19 * the next token is ==, with precedence 0. the outer while loop is not left. * ~~<em>~~''op~~</em>~~'' is == (precedence 0) and the input is advanced * ~~<em>~~''rhs~~</em>~~'' is 19 * the next token is ~~<em>~~''end-of-line~~</em>~~'', which is not an operator. nothe ~~recursive~~inner ~~invocation~~while loop is not entered. * ~~<em>~~''lhs~~</em>~~'' is assigned the result of evaluating 19 == 19, for example 1 (as in the C standard). * the next token is ~~<em>~~''end-of-line~~</em>~~'', which is not an operator. the outer while loop is left. 1 is returned. == Pratt parsing == ~~== Alternatives to Dijkstra's Algorithm ==~~ {{external links\|section\|date=August 2023}} Another precedence parser known as Pratt parsing was first described by [[Vaughan Pratt]] in the 1973 paper "Top Down Operator Precedence",<ref>Pratt, Vaughan. "[https://web.archive.org/web/20151223215421/http://hall.org.ua/halls/wizzard/pdf/Vaughan.Pratt.TDOP.pdf Top Down Operator Precedence]." ''Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages'' (1973).</ref> based on [[Recursive descent parser\|recursive descent]]. Though it predates precedence climbing, it can be viewed as a generalization of precedence climbing.<ref>{{cite web \|last1=Norvell \|first1=Theodore \|title=Parsing Expressions by Recursive Descent \|url=http://www.engr.mun.ca/~theo/Misc/pratt_parsing.htm \|website=www.engr.mun.ca \|quote=The purpose of this post is to [... start] with precedence climbing and refactoring it to use the command pattern until we arrive at a Pratt parser. [This is the author who coined the term "precedence climbing".]}}</ref> Pratt designed the parser originally to implement the [[CGOL]] programming language, and it was treated in much more depth in a Masters Thesis under his supervision.<ref>Van De Vanter, Michael L. "[http://publications.csail.mit.edu/lcs/specpub.php?id=715 A Formalization and Correctness Proof of the CGOL Language System]." (Master's Thesis). MIT Laboratory for Computer Science Technical Report MIT-LCS-TR-147 (Cambridge, Massachusetts). 1975.</ref> ~~There are alternative ways to apply operator precedence rules which do not involve the Shunting Yard algorithm.~~ Tutorials and implementations: ~~One is to build a tree of the original expression, then apply tree rewrite rules to it.~~ * [[Douglas Crockford]] based the JavaScript parser in [[JSLint]] on Pratt parsing.<ref>{{cite web\|author=Crockford, D\|title=Top Down Operator Precedence\|date=2007-02-21\|url=http://crockford.com/javascript/tdop/tdop.html}}</ref> * Comparison between Python implementations of precedence climbing and Pratt parsing: [http://www.oilshell.org/blog/2016/11/01.html "Pratt Parsing and Precedence Climbing Are the Same Algorithm" (2016) by Andy Chu] * Tutorial using [[Rust (programming language)\|Rust]]: [https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html "Simple but Powerful Pratt Parsing" (2020) by Aleksey Kladov] * Tutorial using [[Rust (programming language)\|Rust]]: [https://williamr.dev/posts/pratt-parsing/ "The Pratt Parsing Technique" (2024) by William Rågstad] * Tutorial using [[Python (programming language)\|Python]]: [http://effbot.org/zone/simple-top-down-parsing.htm "Simple Top-Down Parsing in Python" (2008) by Fredrik Lundh] {{Webarchive\|url=https://web.archive.org/web/20150228044653/http://effbot.org/zone/simple-top-down-parsing.htm \|date=2015-02-28 }} * Tutorial using [[Java (programming language)\|Java]]: [http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/ "Pratt Parsers: Expression Parsing Made Easy" (2011) by Bob Nystrom, author of Crafting Interpreters] * Implementation in [[C Sharp (programming language)\|C#]]: [https://github.com/atifaziz/Gratt "Gratt: A Generic Vaughn Pratt's top-down operator precedence parser for .NET Standard"] (a [[Generic programming\|generic]] version inspired by the Java implementation presented by Bob Nystrom in [http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/ "Pratt Parsers: Expression Parsing Made Easy"]) == Alternative methods == Such trees do not necessarily need to be implemented using data structures conventionally used for trees. Instead, tokens can be stored in flat structures, such as tables, by simultaneously building a priority list which states what elements to process in which order. As an example, such an approach is used in the ''Mathematical Formula Decomposer'' <ref>http://herbert.gandraxa.com/herbert/mfd.asp</ref>. There are other ways to apply operator precedence rules. One is to build a tree of the original expression and then apply tree rewrite rules to it. Another is the algorithm used in the early FORTRAN I compiler, which is to first fully parenthesise the expression by a trivial macro replacement — inserting a number of parentheses around each operator, such that they lead to the correct precedence when parsed with a 'flat' grammar. (A hack which takes longer to explain properly than to write code for - see below!) Such trees do not necessarily need to be implemented using data structures conventionally used for trees. Instead, tokens can be stored in flat structures, such as tables, by simultaneously building a priority list which states what elements to process in which order. ~~<source lang="c">~~ === Full parenthesization === Another approach is to first fully parenthesize the expression, inserting a number of parentheses around each operator, such that they lead to the correct precedence even when parsed with a linear, left-to-right parser. This algorithm was used in the early [[Fortran#FORTRAN\|FORTRAN I]] compiler:<ref name="Padua2000">{{cite journal \|last1=Padua \|first1=David \|title=The Fortran I Compiler \|year=2000 \|journal=Computing in Science & Engineering \|volume=2 \|issue=1 \|pages=70–75 \|url=http://polaris.cs.uiuc.edu/publications/c1070.pdf \|doi=10.1109/5992.814661 \|bibcode=2000CSE.....2a..70P \|archive-date=2020-06-17 \|access-date=2016-03-29 \|archive-url=https://web.archive.org/web/20200617113640/http://polaris.cs.uiuc.edu/publications/c1070.pdf \|url-status=dead }}</ref> <blockquote> The Fortran I compiler would expand each operator with a sequence of parentheses. In a simplified form of the algorithm, it would * replace <code>+</code> and <code>–</code> with <code>))+((</code> and <code>))-((</code>, respectively; * replace <code></code> and <code>/</code> with <code>)(</code> and <code>)/(</code>, respectively; * add <code>((</code> at the beginning of each expression and after each left parenthesis in the original expression; and * add <code>))</code> at the end of the expression and before each right parenthesis in the original expression. Although not obvious, the algorithm was correct, and, in the words of [[Donald Knuth\|Knuth]], “The resulting formula is properly parenthesized, believe it or not.”<ref name="Knuth1962">{{cite journal \|last1=Knuth \|first1=Donald E. \|title=A HISTORY OF WRITING COMPILERS \|year=1962 \|publisher=Edmund C. Berkeley \|journal=Computers and Automation \|volume=11 \|issue=12 \|pages=8–14 \|url=https://archive.org/details/bitsavers_computersA_13990695}}</ref> </blockquote> {{missing information\|section\|why parenthesization works\|date=May 2023}} Example code of a simple C application that handles parenthesisation of basic math operators (<code>+</code>, <code>-</code>, <code></code>, <code>/</code>, <code>^</code>, <code>(</code> and <code>)</code><!-- parentheses -->): <syntaxhighlight lang="c"> #include <stdio.h> #include <string.h> ~~int main(int argc, char argv[]){~~ ~~int i;~~ // The command-line argument boundary is our lexer. int main(int argc, char argv[]) { int i; printf("(((("); for (i=1; i!=argc; i++) { ~~if(~~ // ~~strcmp~~strlen(argv[i],) ~~"^")~~==0) ~~printf(")^(");~~2 ~~else~~ if~~(strcmp~~ (argv[i], ~~"")==0~~&& !argv[i][1]) ~~printf("))((");~~{ ~~else~~ ~~if(strcmp~~ switch (argv[i]~~, "/")==0~~) ~~printf("))/((");~~{ ~~else~~ ~~if(strcmp(argv[i],~~ ~~"+")==0)~~ case '(': printf("~~)))+~~(((("); continue; ~~else~~ ~~if(strcmp(argv[i],~~ ~~"-")==0~~ case ')': printf(")))~~-(((~~)"); continue; ~~else~~ case '^': printf("%s)^(",); ~~argv[i])~~ continue; case '': printf("))(("); continue; case '/': printf("))/(("); continue; case '+': // unary check: either first or had an operator expecting secondary argument if (i == 1 \|\| strchr("(^/+-", argv[i-1])) printf("+"); else printf(")))+((("); continue; case '-': if (i == 1 \|\| strchr("(^/+-", argv[i-1])) printf("-"); else printf(")))-((("); continue; } } printf("%s", argv[i]); } printf("))))\n"); return 0; } </syntaxhighlight> ~~</source>~~ ~~Invoke it as: <pre>~~ First, you need to compile your program. Assuming your program is written in C and the source code is in a file named program.c, you would use the following command: ~~$ cc -o parenthesise parenthesise.c~~ <pre>gcc program.c -o program</pre> ~~$ ./parenthesise a \* b + c ^ d / e~~ The above command tells gcc to compile program.c and create an executable named program. ~~((((a))((b)))+(((c)^(d))/((e))))~~ ~~</pre>~~ Command to run the program with parameters, For example; a b + c ^ d / e <pre>./program a '' b + c '^' d / e</pre> it produces <pre>((((a))((b)))+(((c)^(d))/((e))))</pre> as output on the console. A limitation to this strategy is that unary operators must all have higher precedence than infix operators. The "negative" operator in the above code has a higher precedence than exponentiation. Running the program with this input <pre>- a ^ 2</pre> produces this output <pre>((((-a)^(2))))</pre> which is probably not what is intended. ==References== {{Reflist}} ~~<references/>~~ ==External links== * {{cite web\| last=Clarke \| first=Keith \| date=1992-05-26 \| title=Re: compact recursive-descent parsing of expressions \| url=http://groups.google.com/group/comp.compilers/browse_thread/thread/aee24b9504d527ca/1210ef8ae529accd?#1210ef8ae529accd \| access-date=2012-01-24}} [http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm Parsing Expressions by Recursive Descent] Theodore Norvell (C) 1999-2001. Access data September 14, 2006. [http://groups.google.com/group/comp.compilers/browse_thread/thread/442ca111b2dfb38d/5390237263b3625b?#5390237263b3625b Example C++ code by Keith Clarke for parsing infix expressions using the precedence climbing method] * {{cite journal\| last=Samelson \| first=Klaus \| author-link=Klaus Samelson \|author2=Friedrich L. Bauer \|author2-link=Friedrich L. Bauer \|date=February 1960 \| title=Sequential formula translation \| journal=Communications of the ACM \| volume=3 \| issue=2 \| pages=76–83 \| doi=10.1145/366959.366968\| doi-access=free }} * [https://web.archive.org/web/20160817143616/http://radovan-augustin.info/techpapers/parser/Parser.pdf Parser for expression with infix notation using precedence lists] {{Parsers}} {{DEFAULTSORT:Operator-Precedence Parser}} [[Category:Parsing algorithms]] [[Category:Articles with example C code]]