Files
2025-10-27 17:19:12 +09:00

11 KiB

Syntax Analysis

Specification

Context-Free Grammars (CFG)

There are four main compoennts of CFG

  • Terminal Symbols
  • Non-terminal Symbols
  • Start Symbol S
  • Production

Language generated by a CFG is a set of strings of terminals by repeatedly applying productions to the non-terminals: L(G) indicates a language generated by the grammar G.

We can use CFGs to express the syntax of the target programming languages: Parser detects that the token stream from lexer is valid or invalid.

We cannot rely on regex to sepcify the syntax: Because regex is not expressive enough to describe valid syntax. (e.g. nested parenthesis)

Recognition

Parse Tree

A tree representation of the derivation.

parse tree has terminals at the leaves, non-terminals at the interior.

An in-order traversal of the leaves is the original input.

We can appply productions for the non-terminals in any order:

  • leftmost derivation
  • rightmost derivation

Ambiguity

A grammar G is ambigous if it produces different parse tree depening on the order.

It should be resolved to construct a useful parser.

removing ambiguity

Example of Ambiguity.

  1. Precedence: The production at higher levels will have operators with lower priorities (and vice versa). we can insert non-terminals to enforce precendence.

  2. Associativity: We should determine where to place recursion depending on the associativity.

    • left associative: place the recursion on the left
    S -> S - T | T
    T -> id
    
    • right associative: place the recursion on the right
    S -> T ^ S | T
    T -> id
    
    • non associative: do not use recursion
    S -> A < A | A
    A -> id
    

AST (Abstract Syntax Tree)

AST discards unneeded information for syntax analysis: removes non-terminals from parse tree.

Error Handling

One of the purposes of the compiler is error handling.

  • to detect non-valid programs
  • and to translate the non-valid to the valid

so therefore, error handler should:

  • report errors accurately and clearly
  • recover quickly from an error
  • not slow down compilation of valid code

so Why?

  • back in the day, the compiler was extremely slow.
  • but nowadays, we do not need complex error handling procedure
    • Quick recompilatio
    • Users tend to correct few errors at once
    • Does not need a complex error recovery procedure

Automation

How to generate parse tree from CFG?

Top-Down Parsing

Construct a leftmost derivation of string while reading the token stream. e.g.

S -> E + S | E
E -> num | (S)

We can implment it as recursive descent parsing. Recursive descent parsing is try out rules in order and backtrack if the production does not generate proper token.

Predictive Parsing and LL(1)

But it needs backtracking. So we introduce predictive parsing.

Predictive Parsing applies a single production without "backtracking". LL(1) grammar can apply "at most a single production", which actually eliminates the multiple matches in top-down parsing(recursive-descent).

Parser Implementation

Recursive Descent Parser by LL(1)

S -> ES'
S' -> +ES' | e
E -> num | (S)

We can use the table to implement parsers.

* num + ( ) $(EOF)
S ES_ ES_
S_ + S e e
E num (S)
void parse_S() {
    switch(token) {
        case num: parse_E(); parse_S_(); return;
        case '(': parse_E(); parse_S_(); return;
        default: ParseError();
    }
}
void parse_S_() {
    switch(token) {
        case '+': token=input.next(); parse_S_(); return;
        case ')': return;
        case EOF: return;
        default: ParseError();
    }
}

void parse_E() {
    switch(token) {
        case num: token = input.next(); return;
        case '(': token = input.next(); parse_S(); if(token != ')') ParseError(); token = input.next(); return;
        default: ParseError();
    }
}

Parsing Tables

And then How to Construct a Parsing Tables? There are three important traits to gen parse tables.

  • x is nullable if it can derive an empty string.
  • \text{First}(x) is a set of terminals that can derived in the first position.
  • \text{Follow}(x) is a set of terminals that can appears after \alpha in at least one of the derivations.

Computing Nullable

  1. Easy

Computing First

  1. \text{First}(t) = \set{t}
  2. For a production x \to A_1 A_2 \dots A_n where A_1 \dots A_{i-1} are nullable, then \text{First}(x) += \text{First}(A_{i})

Computing Follow

  1. \text{Follow}(S) = \set{ \$ }, (S is the start symbol)
  2. For a production x \to A_1 A_2 \dots A_n where A_{i+1}\dots A_{n} are nullable, then \text{Follow}(A_{i}) += \text{Follow}(x)
  3. For a production x \to A_1 A_2 \dots A_n where A_{i+1}A_{i+2}\dots A_{j-1} are nullable, \text{Follow}(A_{i}) += \text{Follow}(A_{j})

So Combine Them Together:

S = symbols.start
for x in symbols:
    First(x) = {}; Follow(x) = {}, Nullable(x) = false
Follow(S) = {$}
for t in terminals:
    First(t) = {t}
Nullable(e) = True
while not (First.is_changed or Follow.is_changed or Nullable.is_changed):
    for X, A in productions:
        if all(A) is Nullable: Nullalbe(X) = True
        if A[1..i-1] is Nullable: First(X) += First(A[i])
        if A[i+1..n] is Nullable: Follow(A[i]) += Follow(X)
        if A[i+1..j-1] is Nullable: Follow(A[i]) += First(A[j])

We can use the tables with First, Follow, Nullable to make actual Parsing Tables by combining.

Bottom-Up Parsing

Bottom-up Parsing is more efficient than Top-down parsing by using LR grammars, which means Left-recursive grammars, and Right-most derivation. It relies on Shift-reduce Parsers.

Shift-Reducing Parsing:

Bottom-up parser traces a rightmost derivation in reverse, we should scan the input terminals in a left-to-right manner. So the parser splits a string into two parts: sequence of symbols to reduce(stack), remaining tokens(input). And shift-reduce parsing requires two actions: Reduce, Shift.

What's the important challenge is "Action Selection Problem": As there can be conflicts: For a given state(stack + input) there can be multiple possible actions:

  • shift-reduce conflict
  • reduce-reduce conflict

LR Grammars

  • LR(k): left-to-right scanning, right most derivation and k symbol lookahead
  • LR(0) Grammar LR(0) indicates grammars that can determine actions without any lookahead. There are no reduce-reduce and shift-reduce conflicts, because it should be determined by stacks.

NFA Representations

We can represent shift-reduce parsing using an NFA, whose states are production with separator '.' on RHS. And we have an additional dummy production S' -> S$ for a start and end state. There are two types of transitions between the states:

  • shift transition: transition by the shift actions.
  • \epsilon transition: that is the parser expand the expected list not consuming any input tokens. (transition to LHS of the production is equal to next of the current position)

NFA to DFA

NFA can be fully converted to DFA. by using DFA, the parser determine whether to shift or reduce: by using symbols in the stack to traverse the state and determine whether to shift or reduce by destination state.

Parsing Table And LR(0)

DFA Traversal is implemented by simplifying DFA to LR Parsing Table.

Store the states along with the symbols in the stack: <symbol, state>

There are two different types of tables: goto, action.

  • goto: determine the next state using the top state and an input non-terminal.
  • action: determine the action using the top state and an input terminal.

And table consists of four different actions:

  • shift x: push<a, x> on the stack (a is current input and x is a state)
  • reduce x -> a: pop a from the stack and push <x, goto[curr_state, x]>
  • accept(S' -> S$.) / Error

Also DFA states are converted to index of each rows.

But There is a limitation when there are multiple options to fill the parsing table, which should be solved with lookahead.

SLR(1) Parsing

A simple extension of LR(0).

For each reduction X -> b, look at the next symbol c and then apply reduction only if c is in Follow(X) which is a lookahead.

LR(1) Parsing

LR(1) uses lookahead more delicately. For them, it uses a more complex state like X -> a.b,c, which means:

  1. a is already matched at top of the stack
  2. next expect to see b followed by c Also X -> a.b,{x1,x2,...,xn} indicates:
  • forall i in {x1,...,xn}, X -> a.b,i

We extend the $\epsilon$-closure and goto operation.

LR(1) closure identification:

  • start with Closure(S) = S
  • foreach item: [X -> a.Bb,c] in S
    • add {B -> .y,First(bc)}
  • Initalize the state with [S' -> .S,$]

LR(1) goto: Given an Item in the state I: [X -> a.Bb,c], Goto/Shift(I, B) = Closure([X -> aB.b,c])

LR(1) Parsing Table is same as LR(0) except for reductions.

LALR(1) Parsing

LR(1) has too many states. LALR(1) Parsing.

LR(1) parsing is a LookAhead LR. Construct LR(1) DFA and merges any two LR(1) states whose items have the same production rule, but different lookahead. It reduces the number of parser table entries, but theoretically less powerful than LR(1).

LR(1) generally has the same number of states as SLR(1) but much less than LR(1). But we will not dive into the details of LALR(1).

LL/LR Grammars

  1. LL Parsing Tables
    • Table[NT, T] = Production to apply
    • Compute using First, Follow.
  2. LR Parsing Tables
    • Table[LR State, Term] = shift/reduce/error/accept
    • Table[LR State, NT] = goto/err
    • Computing using closure and goto operations on LR states

Automatic Disambiguation

It is highly complex to propose unambiguous grammars: precedence, associativity. By defining precedence, using ambiguous grammars without shift-reduce conflicts: define precedence between terminals on the stack vs. terminals on the input.

AST Data Structure

LL/LR parsing implicitly build AST.

  • LL parsing: AST represented by the productions
  • LR parsing: AST represented by the reduction

AST Construction in LL

expr parse_S() {
    switch (token) {
        case num:
        case '(':
            expr child1 = parse_E();
            expr child2 = parse_S_();
            return new S(child1, child2);
        default: ParseError();
    }
}

AST Construction in LR

Construction mechanism:

  • Store parts of the tree on the stack
  • foreach nonterminal X on the stack, store the sub-tree for X
  • After reduce operation for a production X -> a, create an AST node for X