Stanford Compilers CS143 - huku-/research GitHub Wiki

Stanford Compilers (CS143) course notes

These notes were taken while I was studying Stanford's CS143 course and refreshing my rusty knowledge on compilers. The reason I decided to write the following is that, every now and then, I need to recall some specific detail of a parsing algorithm and, in this case, having a reference comes in handy. Hopefully, Stanford students and compiler geeks might find the following helpful when solving their assignments.

Feel free to mail me for any corrections or suggestions.

Top-down parsing

Graph-search with backtracking

BFS (Breadth First Search) parsing

Maintain a worklist of sentential forms, initially just the start symbol $S$
While the worklist isn't empty:
1. Remove an element $A\rightarrow\gamma$ from the worklist
2. If it matches the target string, we're done. Otherwise, for each possible string that can be derived in one step (i.e. $\gamma$ ), add that string to the worklist. Notice that $\gamma$ may consist of both terminals and non-terminals.

Parse tree is tracked in the worklist.

BFS is very slow:

Wasted effort in expanding sentential forms that will never be matched
High branching factor

Leftmost BFS parsing

Same as above but only considers lefmost derivations that match the input string
Less wasted effort and branching factor
Will always find a valid parse of a program if one exists
Can be modified to find if a program can't be parseed
Left-recursion might result in exponential parsing time and memory

Leftmost DFS (Depth First Search) parsing

Lower memory usage
High performance
Easy to implement
Fails on left-recursive grammars (but left-recursion can be automatically eliminated)

Comparison of graph search methods

BFS	DFS
Works on all grammars	Works on grammars without left-recursion
Worst-case runtime is exponential	Worst-case runtime is exponential
Worst-case memory is exponential	Worst-case memory is linear
Rarely used in practice	Used in limited form as recursive descent

Predictive top-down parsing

LL(1)
- L: Left-to-right scan of tokens
- L: Leftmost derivation
- 1 token of lookahead
Predictive top-down parsing is based on FIRST sets

Computing FIRST sets

epsilon-free FIRST The following fixed-point iteration or transitive closure algorithm can be used:

For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
Repeat until no changes occur
1. For each non-terminal A, for each production A -> Bw, set FIRST(A) = FIRST(A) U FIRST(B)
For all non-terminals A, set FIRST(A) = { t | A -> tw for some w }
For all non-terminals A where A -> e is a production, add e to FIRST(A)
Repeat until no changes occur
For each production A -> a where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U { e }
For each production A -> atw where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) U { t }
For each production A -> aBw, where a is a string of non-terminals whose FIRST sets contain e, set FIRST(A) = FIRST(A) U (FIRST(B) - {e})

LL(1) table construction agorithm

LL(1) parsing algorithm

Initialize stack with S$
Repeat until stack is empty
Let the next character of w be t
If the top of the stack is a terminal r
1. If r and t don't match, report an error
2. Otherwise consume t and pop r from the stack
If the top of the stack is a non-terminal A
1. If T[A, t] is undefined, report an error
2. Replace top of the stack with T[A, t]

Comparison of graph-search-based vs. predictive top-down parsing

Graph-search-based	Predictive top down
Can parse all grammars	Can parse limited number of grammars

Bottom-up parsing

Bottom-up parsing is the process of building the input's syntax tree in reverse
Bottom-up parsing is based on handle recognition. Handles are defined as either:
1. The the leftmost complete cluster of leaf nodes
2. An expansion of the rightmost non-terminal of the input
Handles are always found at the top of the parser's stack
The series of presentations cover four bottom-up shift/reduce parsing algorithms, namely LR(0), SLR(1), LALR(1), LR(1), sorted in order of parsing power, from least to most powerful. However, the aforementioned algorithms can be better understood when studied in the following order: LR(0), LR(1), SLR(1), LALR(1).
The aforementioned algorithms are:
- Directional - Scan the input from left-to-right (LR)
- Predictive - Guess which production should be inverted

LR(0) automata construction algorithm

The algorithm for constructing LR(0) NFA can be found in [02] p.134:

Algorithm #1

Create a state for each non-terminal
For each production $A\rightarrow\gamma$
Construct states $A\rightarrow\alpha\cdot\omega$ for each possible way of splitting $\gamma=\alpha\omega$
Add transitions on $x$ between $A\rightarrow\alpha\cdot x\omega$ and $A\rightarrow\alpha x\cdot\omega$
For each state $A\rightarrow\alpha\cdot B\omega$ for non-terminal $B$ , add an $\epsilon$ -transition from $A\rightarrow\alpha\cdot B\omega$ to $B$

Using subject construction, the NFA above may be converted to a DFA. However, a different, yet equivalent approach, can be followed for directly constructing an LR(0) DFA. See [02] p.162.

Algorithm #2

Begin with a state containing $S$
Compute the closure of the state

If $A\rightarrow\alpha\cdot B\omega$ is in the state, add $B\rightarrow\cdot\gamma$ to the state for each production $B\rightarrow\gamma$

Repeat until no new states are added:

If a state contains a production $A\rightarrow\alpha\cdot x\omega$ for symbol $x$ , add a transition on $x$ from that state to the state containing the closure of $A\rightarrow\alpha x\cdot\omega$

LR(0) parsing algorithm

Set the top of the stack to (?, 1)
While the stack is not empty:
- If action[state] is shift: then shift the input and set
  - Let the next symbol of input be t
  - Push (t, goto[state, t])
- If action[state, t] is reduce
  - Pop $|\omega|$ symbols off the stack
  - Let the sate atop the stack be top-state
  - Push (A, goto[top-state, A])
- Otherwise report an error

LR(1) automata construction algorithm

LR(1) is sometimes mentioned in the literature as CLR(1) (C stands for Canonical).

The algorithm for constructing LR(1) NFA can be found in [02] p.325:

Begin with a state $S[\$]$
For each state $A[t]$ , for each production $A\rightarrow\gamma$
Construct states $A\rightarrow\alpha\cdot\omega[t]$ for all possible ways of splitting $\gamma=\alpha\omega$
Add $\epsilon$ -transition from $A[t]$ to each of these states
Add transitions on $x$ between $A\rightarrow\alpha\cdot x\omega[t]$ and $A\rightarrow\alpha x\cdot\omega[t]$
For each state $A\rightarrow\alpha\cdot B\omega[t]$ , add an $\epsilon$ -transition from $A\rightarrow\alpha\cdot B\omega[t]$ to $B[r]$ for each terminal $r\in FIRST*(\omega t)$

To directly construct the LR(1) DFA:

Begin in a state containing $S\rightarrow\cdot E[\$]$ , where $S$ is the start symbol
Compute the closure of the state

If $A\rightarrow\alpha\cdot B\omega[t]$ is in the state, add $B\rightarrow\cdot\gamma[r]$ to the state for each production $B\rightarrow\gamma$ and for each terminal $r\in FIRST*(\omega t)$

Repeat until no new states are added

If a state contains a production $A\rightarrow\alpha\cdot x\omega[t]$ , add a transition on $x$ from that state to the state containing the closure of $A\rightarrow\alpha x\cdot\omega[t]$

LR(1) parsing algorithm

Begin with an empty stack and the input set to $\omega\$$ , where $\omega$ is the string to parse. Set state to the initial state.
Repeat the following:
- Let the next symbol of input be t
- If action[state, t] is shift, then shift the input and set state = goto[state, t]
- If action[state, t] is reduce
  - Pop $|\omega|$ symbols off the stack; replace them with $A$
  - Let the sate atop the stack be top-state
  - Set state = goto[top-state, A]
- If action[state, t] is accept, the the parse is done
- If action[state, t] is error, report an error

SLR(1)

SLR(1) stands for Simple LR(1)
Modified LR(0) that uses lookahead to avoid shift/reduce conflicts
Only reduce $A\rightarrow\omega$ if the next token is in $FOLLOW(A)$
In SLR(1), a lookahead of $FOLLOW(A)$ , practically means "What could follow $A$ somewhere in the grammar?", even if in a particular state $A$ couldn't possibly have that symbol after it.

LALR(1)

LALR(1) is the topic discussed in [03]. Almost all of the slide deck is dedicated to explaining in detail the problems that LALR(1) tries to solve. Understanding LALR(1) is a piece of cake if you follow the author's train of thought.
LALR(1) stands for Lookahead(1) LR(0)
Practically the LALR(1) can be constructed by merging LR(1) states with equal cores. However, this is not very practical, as LR(1) automata tend to be large.
The resulting LALR(1) automaton cannot have more shift/reduce conflicts than the original LR(1) automaton, since both automata share the same cores. However, extra reduce/reduce conflicts might arise.
Two algorithms for more efficient construction of LALR(1) automata:
- Lazy merging - Start constructing LR(1) states and maintain them in a worklist. As new states are created, their cores are looked up in the worklist and merging takes place.
- The elegant way [04] - Discussed in "Simple computation of LALR(1) lookahead sets" by Manuel E. Bermudez and George Logothetis:
  1. Construct the LR(0) automaton
  2. Construct an annotated grammar this way; For each $A\rightarrow\cdot\omega$ in some state $q$ :
    1. Trace out the path $\omega$ takes through the LR(0) automaton starting at $q$
    2. Replace each non-terminal in $A\rightarrow\cdot\omega$ with a non-terminal annotated with the state transitioned to and from by the edge labeled with that non-terminal
    3. Replace $A$ with a non-terminal annotated with the start and end state of the transition on $A$ out of $q$
  3. Compute FOLLOW sets in the resulting automaton. The resulting sets are more precise than the original grammar's FOLLOW sets as they implicitly hold context information.
  4. Propagate changes - For each item $A\rightarrow\cdot\omega$ in a state $q$
    1. Let $A_{q-r}$ be the non-terminal corresponding to $A$ following the transition out of $q$ into some state $r$ .
    2. Trace through the automaton along the path labeled by $\omega$ . This will lead to a state containing an item $A\rightarrow\omega\cdot$ .
    3. Add to the lookahead of $A\rightarrow\omega\cdot$ the contents of $FOLLOW(A_{q-r})$

References

[01] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/04/Slides04B.pdf

[02] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/05/Slides05.pdf

[03] https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/06/Slides06.pdf

[04] https://www.sciencedirect.com/science/article/pii/0020019089900793