Stanford Compilers CS143 - huku-/research GitHub Wiki
Stanford Compilers (CS143) course notes
These notes were taken while I was studying Stanford's CS143 course and refreshing my rusty knowledge on compilers. The reason I decided to write the following is that, every now and then, I need to recall some specific detail of a parsing algorithm and, in this case, having a reference comes in handy. Hopefully, Stanford students and compiler geeks might find the following helpful when solving their assignments.
Feel free to mail me for any corrections or suggestions.
Top-down parsing
Graph-search with backtracking
BFS (Breadth First Search) parsing
- Maintain a worklist of sentential forms, initially just the start symbol
- While the worklist isn't empty:
- Remove an element
from the worklist
- If it matches the target string, we're done. Otherwise, for each possible
string that can be derived in one step (i.e.
), add that string to the worklist. Notice that
may consist of both terminals and non-terminals.
- Remove an element
Parse tree is tracked in the worklist.
BFS is very slow:
- Wasted effort in expanding sentential forms that will never be matched
- High branching factor
Leftmost BFS parsing
- Same as above but only considers lefmost derivations that match the input string
- Less wasted effort and branching factor
- Will always find a valid parse of a program if one exists
- Can be modified to find if a program can't be parseed
- Left-recursion might result in exponential parsing time and memory
Leftmost DFS (Depth First Search) parsing
- Lower memory usage
- High performance
- Easy to implement
- Fails on left-recursive grammars (but left-recursion can be automatically eliminated)
Comparison of graph search methods
BFS | DFS |
---|---|
Works on all grammars | Works on grammars without left-recursion |
Worst-case runtime is exponential | Worst-case runtime is exponential |
Worst-case memory is exponential | Worst-case memory is linear |
Rarely used in practice | Used in limited form as recursive descent |
Predictive top-down parsing
-
LL(1)
- L: Left-to-right scan of tokens
- L: Leftmost derivation
- 1 token of lookahead
-
Predictive top-down parsing is based on FIRST sets
Computing FIRST sets
epsilon-free FIRST The following fixed-point iteration or transitive closure algorithm can be used:
-
For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
-
Repeat until no changes occur
- For each non-terminal A, for each production A -> Bw, set FIRST(A) = FIRST(A) U FIRST(B)
-
For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
-
For all non-terminals A where A -> e is a production, add e to FIRST(A)
-
Repeat until no changes occur
-
For each production A -> a where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U { e }
-
For each production A -> atw where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) U { t }
-
For each production A -> aBw, where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U (FIRST(B) - {e})
LL(1) table construction agorithm
LL(1) parsing algorithm
- Initialize stack with S$
- Repeat until stack is empty
- Let the next character of w be t
- If the top of the stack is a terminal r
- If r and t don't match, report an error
- Otherwise consume t and pop r from the stack
- If the top of the stack is a non-terminal A
- If T[A, t] is undefined, report an error
- Replace top of the stack with T[A, t]
Comparison of graph-search-based vs. predictive top-down parsing
Graph-search-based | Predictive top down |
---|---|
Can parse all grammars | Can parse limited number of grammars |
Bottom-up parsing
-
Bottom-up parsing is the process of building the input's syntax tree in reverse
-
Bottom-up parsing is based on handle recognition. Handles are defined as either:
- The the leftmost complete cluster of leaf nodes
- An expansion of the rightmost non-terminal of the input
-
Handles are always found at the top of the parser's stack
-
The series of presentations cover four bottom-up shift/reduce parsing algorithms, namely LR(0), SLR(1), LALR(1), LR(1), sorted in order of parsing power, from least to most powerful. However, the aforementioned algorithms can be better understood when studied in the following order: LR(0), LR(1), SLR(1), LALR(1).
-
The aforementioned algorithms are:
- Directional - Scan the input from left-to-right (LR)
- Predictive - Guess which production should be inverted
LR(0) automata construction algorithm
The algorithm for constructing LR(0) NFA can be found in [02] p.134:
Algorithm #1
- Create a state for each non-terminal
- For each production
- Construct states
for each possible way of splitting
- Add transitions on
between
and
- For each state
for non-terminal
, add an
-transition from
to
Using subject construction, the NFA above may be converted to a DFA. However, a different, yet equivalent approach, can be followed for directly constructing an LR(0) DFA. See [02] p.162.
Algorithm #2
- Begin with a state containing
- Compute the closure of the state
- If
is in the state, add
to the state for each production
- Repeat until no new states are added:
- If a state contains a production
for symbol
, add a transition on
from that state to the state containing the closure of
LR(0) parsing algorithm
- Set the top of the stack to (?, 1)
- While the stack is not empty:
- If action[state] is shift:
then shift the input and set
- Let the next symbol of input be t
- Push (t, goto[state, t])
- If action[state, t] is reduce
- Pop
symbols off the stack
- Let the sate atop the stack be top-state
- Push (A, goto[top-state, A])
- Pop
- Otherwise report an error
- If action[state] is shift:
then shift the input and set
LR(1) automata construction algorithm
LR(1) is sometimes mentioned in the literature as CLR(1) (C stands for Canonical).
The algorithm for constructing LR(1) NFA can be found in [02] p.325:
- Begin with a state
- For each state
, for each production
- Construct states
for all possible ways of splitting
- Add
-transition from
to each of these states
- Add transitions on
between
and
- For each state
, add an
-transition from
to
for each terminal
To directly construct the LR(1) DFA:
- Begin in a state containing
, where
is the start symbol
- Compute the closure of the state
- If
is in the state, add
to the state for each production
and for each terminal
- Repeat until no new states are added
- If a state contains a production
, add a transition on
from that state to the state containing the closure of
LR(1) parsing algorithm
- Begin with an empty stack and the input set to
, where
is the string to parse. Set state to the initial state.
- Repeat the following:
- Let the next symbol of input be t
- If action[state, t] is shift, then shift the input and set state = goto[state, t]
- If action[state, t] is reduce
- Pop
symbols off the stack; replace them with
- Let the sate atop the stack be top-state
- Set state = goto[top-state, A]
- Pop
- If action[state, t] is accept, the the parse is done
- If action[state, t] is error, report an error
SLR(1)
-
SLR(1) stands for Simple LR(1)
-
Modified LR(0) that uses lookahead to avoid shift/reduce conflicts
-
Only reduce
if the next token is in
-
In SLR(1), a lookahead of
, practically means "What could follow
somewhere in the grammar?", even if in a particular state
couldn't possibly have that symbol after it.
LALR(1)
-
LALR(1) is the topic discussed in [03]. Almost all of the slide deck is dedicated to explaining in detail the problems that LALR(1) tries to solve. Understanding LALR(1) is a piece of cake if you follow the author's train of thought.
-
LALR(1) stands for Lookahead(1) LR(0)
-
Practically the LALR(1) can be constructed by merging LR(1) states with equal cores. However, this is not very practical, as LR(1) automata tend to be large.
-
The resulting LALR(1) automaton cannot have more shift/reduce conflicts than the original LR(1) automaton, since both automata share the same cores. However, extra reduce/reduce conflicts might arise.
-
Two algorithms for more efficient construction of LALR(1) automata:
-
Lazy merging - Start constructing LR(1) states and maintain them in a worklist. As new states are created, their cores are looked up in the worklist and merging takes place.
-
The elegant way [04] - Discussed in "Simple computation of LALR(1) lookahead sets" by Manuel E. Bermudez and George Logothetis:
-
Construct the LR(0) automaton
-
Construct an annotated grammar this way; For each
in some state
:
-
Trace out the path
takes through the LR(0) automaton starting at
-
Replace each non-terminal in
with a non-terminal annotated with the state transitioned to and from by the edge labeled with that non-terminal
-
Replace
with a non-terminal annotated with the start and end state of the transition on
out of
-
-
Compute FOLLOW sets in the resulting automaton. The resulting sets are more precise than the original grammar's FOLLOW sets as they implicitly hold context information.
-
Propagate changes - For each item
in a state
-
Let
be the non-terminal corresponding to
following the transition out of
into some state
.
-
Trace through the automaton along the path labeled by
. This will lead to a state containing an item
.
-
Add to the lookahead of
the contents of
-
-
-
References
[01] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/04/Slides04B.pdf
[02] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/05/Slides05.pdf
[03] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/06/Slides06.pdf
[04] https://www.sciencedirect.com/science/article/pii/0020019089900793