Content deleted Content added
By 'terminal' they seem to have meant 'nonterminal'? |
m →top: HTTP to HTTPS for SourceForge |
||
(104 intermediate revisions by 63 users not shown) | |||
Line 1:
{{Short description|Algorithm used to analyze and process programming languages}}
In [[computer science]], a '''canonical LR parser''' or '''LR(1) parser''' is an [[LR(k)]] parser for ''k=1'', i.e. with a single [[Parsing#Lookahead|lookahead]] [[terminal symbol|terminal]]. The special attribute of this parser is that all LR(k) parsers with ''k>1'' can be transformed into a LR(1) parser.<ref name="Knuth 1965">{{cite journal | last=Knuth | first=Donald | title=On the Translation of Languages from Left to Right | work=Information and Control | volume=8 | pages=707 - 639 | year=1965 | month=July | url=http://www.cs.dartmouth.edu/~mckeeman/cs48/mxcom/doc/knuth65.pdf | accessdate=29 May 2011 }}</ref> It can handle all [[deterministic context-free language]]s <ref name="Knuth 1965"/> but is mostly avoided because of its high memory requirements in favor of less powerful but more efficient alternatives such as the [[LALR]] and the [[LL(k)]] parser.▼
A '''canonical LR parser''' (also called a '''LR(1) parser''') is a type of [[bottom-up parsing]] algorithm used in [[computer science]] to analyze and process [[programming language]]s. It is based on the [[LR parsing]] technique, which stands for "left-to-right, rightmost derivation in reverse."
▲
Like most parsers, this parser is automatically generated by [[compiler compiler]]s like [[GNU Bison]] and Menhir.<ref>{{cite web|title=What is Menhir?|url=http://cristal.inria.fr/~fpottier/menhir/|publisher=INRIA, CRISTAL project|accessdate=29 June 2012}}</ref>▼
▲Like most parsers,
== History ==
In 1965 [[Donald Knuth]] invented the
Canonical LR(1) parsers have the practical disadvantage of having enormous memory requirements for their internal parser-table representation. In 1969, Frank DeRemer suggested two simplified versions of the LR parser called [[LALR parser|LALR]] and [[Simple LR parser|SLR]]. These parsers require much less memory than Canonical LR(1) parsers, but have slightly less language-recognition power.<ref name="DeRemer 1969">{{cite web|title=Practical Translators for LR(k) languages |url=http://computer-refuge.org/bitsavers/pdf/mit/lcs/tr/MIT-LCS-TR-065.pdf |author=Franklin L. DeRemer |publisher=MIT, PhD Dissertation |year=1969 |url-status=dead |archive-url=https://web.archive.org/web/20120405124425/http://computer-refuge.org/bitsavers/pdf/mit/lcs/tr/MIT-LCS-TR-065.pdf |archive-date=April 5, 2012 }}</ref> LALR(1) parsers have been the most common implementations of the LR Parser.
▲In 1965 [[Donald Knuth]] invented the [[LR parser|LR(k) parser]] ('''L'''eft to right, [[Rightmost_derivation#Derivations_and_syntax_trees|'''R'''ightmost derivation]] parser) a type of [[shift-reduce parser]], as a generalization of existing [[precendence parser]]s. This parser has the potential of recognizing all deterministic context-free languages and can produce both left and right derivations of grammar rules. Knuth proved that it reaches its maximum language recognition power for k=1 and provided a method for transforming LR(k), k > 1 grammars into a LR(1) grammar. <ref name="Knuth 1965"/>
== Overview ==
The LR(1) parser is a [[deterministic automaton]] and as such its operation is based on static [[state transition table]]s. These codify the grammar of the language it recognizes and are typically called "parsing tables".
The parsing tables of the LR(1) parser are parameterized with a lookahead terminal. Simple parsing tables, like those used by the [[LR(0)]] parser represent grammar rules of the form
: A1 → A
which means that if we go
: A1 → A
which means that the
: A1 → A
: A2 → A
: A3 → A
: A4 → A
The same would not be true if a lookahead terminal was not being taken into account. Parsing errors can be identified without the parser having to read the whole input by declaring some rules as errors. For example,
: E1 → B
can be declared an error, causing the parser to stop. This means that the lookahead information can also be used to catch errors, as in the following example:
: A1 → A
: A1 → A
: A1 → A
: E1 → A
In this case A
The lookahead can also be helpful in deciding when to reduce a rule. The lookahead can help avoid reducing a specific rule if the lookahead is not valid, which would probably mean that the current state should be combined with the following instead of the previous state. That means
*
* Rules:
:: A1 → A
:: A2 → B
the
: A
instead of
: A1
if the lookahead after the parser went to state B wasn't acceptable, i.e. no transition rule existed.
: X → y
which allows for
LR(1) parsers have the requirement that each rule should be expressed in a complete LR(1) manner, i.e. a sequence of two states with a specific lookahead. That makes simple rules such as
: X → y
requiring a great many artificial rules that essentially enumerate the combinations of all the possible states and lookahead terminals that can follow. A similar problem appears for implementing non-lookahead rules such as
: A1 → A
where all the possible lookaheads must be enumerated. That is the reason why LR(1) parsers cannot be practically implemented without significant memory optimizations.<ref name="Pager 1977"/>
== Constructing LR(1) parsing tables ==
LR(1) parsing tables are constructed in the same way as [[
=== Parser items ===
Starting from the [[
For example, assume a language consisting of the terminal symbols 'n', '+', '(', ')', the nonterminals 'E', 'T', the starting rule 'S' and the following production rules:
: S → E
: E → T
: E → ( E )
: T → n
: T → + T
: T → T + n
Items sets will be generated by analog to the procedure for LR(0) parsers. The item set 0 which represents the initial state will be created from the starting rule:
: [S → • E, $]
The dot '•' denotes the marker of the current parsing position within this rule. The expected lookahead terminal to apply this rule is noted after the comma. The '$' sign is used to denote 'end of input' is expected, as is the case for the starting rule.
This is not the complete item set 0, though. Each item set must be 'closed', which means all production rules for each nonterminal following a '•' have to be recursively included into the item set until all of those nonterminals are dealt with. The resulting item set is called the closure of the item set we began with.
For LR(1) for each production rule an item has to be included for each possible lookahead terminal following the rule. For more complex languages this usually results in very large item sets, which is the reason for the large memory requirements of LR(1) parsers.
In our example, the starting symbol requires the nonterminal '
: [S → • E]
: [E → • T]
: [E → • ( E )]
: [T → • n]
: [T → • + T]
: [T → • T + n]
=== FIRST and FOLLOW sets ===
To determine
FIRST(A) is the set of terminals which can appear as the first element of any chain of rules matching nonterminal A. FOLLOW(I) of an Item I [A → α • B β, x] is the set of terminals that can appear immediately after [[nonterminal]] B, where α, β are arbitrary symbol strings, and x is an arbitrary lookahead terminal. FOLLOW(k,B) of an item set k and
In our example, as one can verify from the full list of item sets below, the first sets are:
: FIRST(
: FIRST(
: FIRST(T) = { n, '+' }
=== Determining lookahead terminals ===
Within item set 0 the follow sets can be found to be:
: FOLLOW(0,S) = { $ }
: FOLLOW(0,E) = { $, ')'}
: FOLLOW(0,T) = { $, '+', ')' }
From this the full item set 0 for an LR(1) parser can be created, by creating for each item in the LR(0) item set one copy for each terminal in the follow set of the LHS nonterminal. Each element of the follow set may be a valid lookahead terminal:
: [S → • E, $]
: [E → • T, $]
: [E → •
: [
: [
: [T → •
: [T → •
: [T → •
: [T → •
: [T → • + T, +]
: [T → • + T, )]
: [T → • T + n, $]
: [T → • T + n, +]
: [T → • T + n, )]
=== Creating new item sets ===
The rest of the item sets can be created by the following algorithm
: 1. For each terminal and nonterminal symbol A appearing after a '•' in each already existing item set k, create a new item set m by adding to m all the rules of k where '•' is followed by A, but only if m will not be the same as an already existing item set after step 3.
: 2. shift all the '•'s for each rule in the new item set one symbol to the right
: 3. create the closure of the new item set
: 4. Repeat from step 1 for all newly created item sets, until no more new sets appear
In the example we get 5 more sets from item set 0, item set 1 for nonterminal E, item set 2 for nonterminal T, item set 3 for terminal n, item set 4 for terminal '+' and item set 5 for '('.
Line 136 ⟶ 145:
Item set 1 (E):
: [S → E •, $]
Item set 2 (T):
: [E → T •, $]
: [T → T • + n, $]
: [T → T • + n, +]
: ·
: ·
: ·
Item set 3 (n):
: [T → n •, $]
: [T → n •, +]
: [T → n •, )]
Item set 4 ('+'):
: [T → + • T, $]
: [T → + • T, +]
: [T → • n, $]
: [T → • n, +]
: [T → • + T, $]
: [T → • + T, +]
: [T → • T + n, $]
: [T → • T + n, +]
: ·
: ·
: ·
Item set 5 ('('):
: [E → ( • E ), $]
: [E → • T, )]
: [E → • ( E ), )]
: [T → • n, )]
: [T → • n, +]
: [T → • + T, )]
: [T → • + T, +]
: [T → • T + n, )]
: [T → • T + n, +]
: ·
: ·
: ·
From item sets 2, 4 and 5 several
=== Goto ===
The lookahead of an LR(1) item is used directly only when considering reduce actions (i.e., when the • marker is at the right end).
The '''core''' of an LR(1) item [S → a A • B e, c] is the LR(0) item S → a A • B e.
For example, in item set 2
: [E → T •, $]
: [T → T • + n, $]
: [T → T • + n, +]
the parser is required to perform the reduction [E → T] if the next symbol is '$', but to do a shift if the next symbol is '+'. Note that a LR(0) parser would not be able to make this decision, as it only considers the core of the items, and would thus report a shift/reduce conflict.
A state containing [A → α • X β, a] will move to a state containing [A → α X • β, a] with label X.
Line 192 ⟶ 211:
=== Shift actions ===
If [A → α • b β, a] is in state I<sub>k</sub> and I<sub>k</sub> moves to state I<sub>m</sub> with label b, then we add the action
: action[I<sub>k</sub>, b] = "shift m"
=== Reduce actions ===
If [A→α •, a] is in state I<sub>k</sub>, then we add the action
== References ==
{{Reflist}}▼
== External links ==
▲{{Reflist}}
* [http://david.tribble.com/text/lrk_parsing.html Practical LR(k) Parser Construction] HTML page, David Tribble
* [https://code.activestate.com/recipes/579140-follow-sets-construction/?in=user-4174427 First & Follow Sets Construction in Python (Narayana Chikkam)]
{{Parsers}}
[[Category:Parsing algorithms]]
|