253 lines
8.1 KiB
Markdown
253 lines
8.1 KiB
Markdown
# Syntax Analysis
|
|
|
|
## Specification
|
|
|
|
### Context-Free Grammars (CFG)
|
|
|
|
There are four main compoennts of CFG
|
|
|
|
* Terminal Symbols
|
|
* Non-terminal Symbols
|
|
* Start Symbol $S$
|
|
* Production
|
|
|
|
Language generated by a CFG is a set of strings of terminals by repeatedly applying productions to the non-terminals: $L(G)$ indicates a language generated by the grammar $G$.
|
|
|
|
We can use CFGs to express the syntax of the target programming languages: Parser detects that the token stream from lexer is valid or invalid.
|
|
|
|
**We cannot rely on regex to sepcify the syntax:** Because regex is not expressive enough to describe valid syntax. (e.g. nested parenthesis)
|
|
|
|
## Recognition
|
|
|
|
### Parse Tree
|
|
|
|
A tree representation of the derivation.
|
|
|
|
parse tree has `terminals` at the leaves, `non-terminals` at the interior.
|
|
|
|
An in-order traversal of the leaves is the original input.
|
|
|
|
We can appply productions for the non-terminals in any order:
|
|
* leftmost derivation
|
|
* rightmost derivation
|
|
|
|
#### Ambiguity
|
|
|
|
A grammar $G$ is ambigous if it produces different parse tree depening on the order.
|
|
|
|
It should be **resolved** to construct a useful parser.
|
|
|
|
**removing ambiguity**
|
|
|
|
Example of Ambiguity.
|
|
|
|
1. Precedence:
|
|
The production at higher levels will have operators with lower priorities (and vice versa).
|
|
we can insert non-terminals to enforce precendence.
|
|
|
|
2. Associativity:
|
|
We should determine where to place recursion depending on the associativity.
|
|
* left associative: place the recursion on the left
|
|
```txt
|
|
S -> S - T | T
|
|
T -> id
|
|
```
|
|
* right associative: place the recursion on the right
|
|
```txt
|
|
S -> T ^ S | T
|
|
T -> id
|
|
```
|
|
* non associative: do not use recursion
|
|
```txt
|
|
S -> A < A | A
|
|
A -> id
|
|
```
|
|
|
|
|
|
### AST (Abstract Syntax Tree)
|
|
|
|
AST discards unneeded information for syntax analysis: removes non-terminals from parse tree.
|
|
|
|
### Error Handling
|
|
|
|
One of the purposes of the compiler is error handling.
|
|
- to detect non-valid programs
|
|
- and to translate the non-valid to the valid
|
|
|
|
so therefore, error handler should:
|
|
* report errors accurately and clearly
|
|
* recover quickly from an error
|
|
* not slow down compilation of valid code
|
|
|
|
so Why?
|
|
* back in the day, the compiler was extremely slow.
|
|
* but nowadays, we do not need complex error handling procedure
|
|
* Quick recompilatio
|
|
* Users tend to correct few errors at once
|
|
* Does not need a complex error recovery procedure
|
|
|
|
## Automation
|
|
|
|
How to generate parse tree from CFG?
|
|
|
|
### Top-Down Parsing
|
|
|
|
Construct a leftmost derivation of string while reading the token stream.
|
|
e.g.
|
|
```txt
|
|
S -> E + S | E
|
|
E -> num | (S)
|
|
```
|
|
|
|
We can implment it as **recursive descent parsing**.
|
|
Recursive descent parsing is try out rules in order and backtrack if the production does not generate proper token.
|
|
|
|
#### Predictive Parsing and LL(1)
|
|
|
|
But it needs **backtracking**.
|
|
So we introduce **predictive parsing**.
|
|
|
|
Predictive Parsing applies a single production without "backtracking". LL(1) grammar can apply **"at most a single production"**, which actually eliminates the multiple matches in top-down parsing(recursive-descent).
|
|
|
|
### Parser Implementation
|
|
|
|
#### Recursive Descent Parser by LL(1)
|
|
|
|
```text
|
|
S -> ES'
|
|
S' -> +ES' | e
|
|
E -> num | (S)
|
|
```
|
|
|
|
We can use the table to implement parsers.
|
|
|
|
| * | num | + | ( | ) | $(EOF) |
|
|
| --- | --- | --- | --- | --- | ------ |
|
|
| S | ES_ | | ES_ | | |
|
|
| S_ | | + S | | e | e |
|
|
| E | num | | (S) | | |
|
|
|
|
```c
|
|
void parse_S() {
|
|
switch(token) {
|
|
case num: parse_E(); parse_S_(); return;
|
|
case '(': parse_E(); parse_S_(); return;
|
|
default: ParseError();
|
|
}
|
|
}
|
|
void parse_S_() {
|
|
switch(token) {
|
|
case '+': token=input.next(); parse_S_(); return;
|
|
case ')': return;
|
|
case EOF: return;
|
|
default: ParseError();
|
|
}
|
|
}
|
|
|
|
void parse_E() {
|
|
switch(token) {
|
|
case num: token = input.next(); return;
|
|
case '(': token = input.next(); parse_S(); if(token != ')') ParseError(); token = input.next(); return;
|
|
default: ParseError();
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Parsing Tables
|
|
|
|
And then How to Construct a Parsing Tables? There are three important traits to gen parse tables.
|
|
* $x$ is nullable if it can derive an empty string.
|
|
* $\text{First}(x)$ is a set of terminals that can derived in the first position.
|
|
* $\text{Follow}(x)$ is a set of terminals that can appears after $\alpha$ in at least one of the derivations.
|
|
|
|
Computing Nullable
|
|
1. Easy
|
|
|
|
Computing First
|
|
1. $\text{First}(t) = \set{t}$
|
|
2. For a production $x \to A_1 A_2 \dots A_n$ where $A_1 \dots A_{i-1}$ are nullable, then $\text{First}(x) += \text{First}(A_{i})$
|
|
|
|
Computing Follow
|
|
1. $\text{Follow}(S) = \set{ \$ }$, ($S$ is the start symbol)
|
|
2. For a production $x \to A_1 A_2 \dots A_n$ where $A_{i+1}\dots A_{n}$ are nullable, then $\text{Follow}(A_{i}) += \text{Follow}(x)$
|
|
3. For a production $x \to A_1 A_2 \dots A_n$ where $A_{i+1}A_{i+2}\dots A_{j-1}$ are nullable, $\text{Follow}(A_{i}) += \text{Follow}(A_{j})$
|
|
|
|
So Combine Them Together:
|
|
|
|
```py
|
|
S = symbols.start
|
|
for x in symbols:
|
|
First(x) = {}; Follow(x) = {}, Nullable(x) = false
|
|
Follow(S) = {$}
|
|
for t in terminals:
|
|
First(t) = {t}
|
|
Nullable(e) = True
|
|
while not (First.is_changed or Follow.is_changed or Nullable.is_changed):
|
|
for X, A in productions:
|
|
if all(A) is Nullable: Nullalbe(X) = True
|
|
if A[1..i-1] is Nullable: First(X) += First(A[i])
|
|
if A[i+1..n] is Nullable: Follow(A[i]) += Follow(X)
|
|
if A[i+1..j-1] is Nullable: Follow(A[i]) += First(A[j])
|
|
```
|
|
|
|
We can use the tables with `First`, `Follow`, `Nullable` to make actual **Parsing Tables** by combining.
|
|
|
|
### Bottom-Up Parsing
|
|
|
|
Bottom-up Parsing is more efficient than Top-down parsing by using **LR grammars**, which means **L**eft-recursive grammars, and **R**ight-most derivation.
|
|
It relies on **Shift-reduce Parsers**.
|
|
|
|
Shift-Reducing Parsing:
|
|
|
|
Bottom-up parser traces a rightmost derivation in reverse, we should scan the input terminals in a left-to-right manner.
|
|
So the parser splits a string into two parts: sequence of symbols to reduce(**stack**), remaining tokens(**input**).
|
|
And shift-reduce parsing requires two actions: **Reduce**, **Shift**.
|
|
|
|
What's the important challenge is "Action Selection Problem":
|
|
As there can be conflicts: For a given state(stack + input) there can be multiple possible actions:
|
|
* shift-reduce conflict
|
|
* reduce-reduce conflict
|
|
|
|
### LR Grammars
|
|
|
|
* LR(k): left-toright scanning, right most derivation and $k$ symbol lookahead
|
|
* LR(0) Grammar
|
|
LR(0) indicates grammars that can determine actions without any lookahead.
|
|
There are **no reduce-reduce and shift-reduce conflicts**, because it should be determined by stacks.
|
|
|
|
|
|
|
|
### NFA Representations
|
|
|
|
We can represent shift-reduce parsing using an **NFA**, whose states are production with separator '`.`' on RHS. And we have an additional dummy production `S' -> S$` for a start and end state.
|
|
There are two types of transitions between the states:
|
|
* **shift transition**: transition by the shift actions.
|
|
* **$\epsilon$ transition**: that is the parser expand the expected list not consuming any input tokens. (transition to LHS of the production is equal to next of the current position)
|
|
|
|
#### NFA to DFA
|
|
|
|
NFA can be fully converted to DFA. by using DFA, the parser determine whether to shift or reduce: by using symbols in the stack to **traverse the state** and **determine whether to shift or reduce by destination state**.
|
|
|
|
### Parsing Table And LR(0)
|
|
|
|
DFA Traversal is implemented by simplifying DFA to **LR Parsing Table**.
|
|
|
|
Store the states along with the symbols in the stack: <`symbol`, `state`>
|
|
|
|
There are two different types of tables: `goto`, `action`.
|
|
* `goto`: determine the next state using the top state and an **input non-terminal**.
|
|
* `action`: determine the action using the top state and an input terminal.
|
|
|
|
And table consists of four different actions:
|
|
* `shift x`: push<a, x> on the stack (a is current input and x is a state)
|
|
* `reduce x -> a`: pop a from the stack and push <`x`, `goto[curr_state, x]`>
|
|
* accept(`S' -> S$.`) / Error
|
|
|
|
DFA states are converted to index of each rows.
|
|
|
|
### SLR(1) Parsing
|
|
|
|
A simple extension of LR(0).
|
|
|
|
For each reduction `X -> b`, look at the next symbol `c` and then apply reduction only if `c` is in `Follow(X)`.
|