1. Understanding Non-Deterministic CFGs
1.1 What is a Non-Deterministic Context-Free Grammar?
A Context-Free Grammar (CFG) is non-deterministic if a parser cannot uniquely decide which production rule to apply based on the current input and the grammar state.
In simple terms, the parser stands at a fork with multiple roads and no clear signpost — it doesn’t know which rule will lead to a successful parse without trying all options.
1.1.1 Real-World Analogy
Imagine a voice assistant trying to interpret "Book a table". Should it book a restaurant table or reserve a spot at a library? Without additional context (lookahead), it's non-deterministic.
1.2 Why is it Needed?
Languages like natural languages or ambiguous programming syntax often require expressive grammars that cannot always be made deterministic.
- Compact Design: Non-deterministic grammars can describe complex syntax with fewer rules.
- Real Language Support: Some constructs (like the "dangling else") naturally require non-determinism for complete expression.
1.3 Deterministic vs Non-Deterministic CFGs
Aspect | Deterministic CFG | Non-Deterministic CFG |
---|---|---|
Parsing Decision | Single production chosen based on input | Multiple productions may apply |
Parsing Speed | Fast (linear) | Slow (may require backtracking or lookahead) |
Examples | LL(1), LR(1) grammars | Ambiguous grammars, natural languages |
Parser Implementation | Simple and efficient | Complex, needs more logic or power |
2. Identifying and Handling Non-Determinism
2.1 Sources of Non-Determinism
- Ambiguity: More than one parse tree for a string
- Common Prefix: Multiple rules start with the same symbols (left common)
- Left Recursion: Rules like A → Aα cause infinite recursion in top-down parsers
2.2 Eliminating Ambiguity (when possible)
Ambiguity cannot be completely eliminated from all CFGs. However, rewriting helps in special cases:
2.2.1 Example: Dangling Else Ambiguity
stmt → if expr then stmt
| if expr then stmt else stmt
| other
This grammar is ambiguous. Solution:
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt | other
open_stmt → if expr then stmt | if expr then matched_stmt else open_stmt
This ensures every else
pairs with the nearest unmatched then
:contentReference[oaicite:0]{index=0}.
2.3 Left Factoring
Used to delay decision-making until enough input is seen.
stmt → if expr then stmt
| if expr then stmt else stmt
Left-factored:
stmt → if expr then stmt stmt'
stmt' → else stmt | ε
This transformation enables predictive parsing:contentReference[oaicite:1]{index=1}.
2.4 Eliminating Left Recursion
Recursive descent parsers fail on left recursion. General transformation:
2.4.1 Rule Transformation
Given:
A → Aα | β
Transform to:
A → βA'
A' → αA' | ε
2.4.2 Algorithm
Algorithm 4.19 handles indirect and direct left recursion:
1. Order nonterminals A₁, A₂, ..., Aₙ
2. for i = 1 to n
for j = 1 to i-1
Replace Aᵢ → Aⱼγ with Aᵢ → δγ for all Aⱼ → δ
Eliminate immediate left recursion on Aᵢ
2.4.3 Example
Given:
S → Aa | b
A → Ac | Sd | ε
Final grammar after removing left recursion:
S → Aa | b
A → bd A' | A'
A' → c A' | ad A' | ε
Now it's ready for top-down parsing:contentReference[oaicite:2]{index=2}.
3. Step-by-Step Left Factoring
3.1 Purpose of Left Factoring
Left factoring is a transformation applied to grammars to make them suitable for predictive parsing. When two or more productions for a non-terminal begin with the same symbols, we delay decision-making by factoring the common prefix out.
3.2 Left Factoring Algorithm (Book Algorithm 4.21)
For a non-terminal A with productions:
A → αβ₁ | αβ₂ | ... | αβₙ | γ
Where α
is the longest common prefix, the left-factored grammar becomes:
A → αA'
A' → β₁ | β₂ | ... | βₙ
A → γ (if γ doesn't start with α)
Repeat this until no two alternatives for a non-terminal share a prefix:contentReference[oaicite:0]{index=0}.
3.3 Example 1: Dangling-Else Grammar
S → if E then S
| if E then S else S
| a
E → b
Step-by-step left factoring:
- Find common prefix:
if E then S
- Factor it out into a new non-terminal
S'
S → if E then S S' | a
S' → else S | ε
E → b
This form defers the decision until after if E then S
is processed:contentReference[oaicite:1]{index=1}.
3.4 Example 2: Abstract Grammar for Regular Expressions
rexpr → rexpr + rterm
| rterm
rterm → rterm rfactor
| rfactor
rfactor → rfactor *
| rprimary
rprimary → a
| b
Step-by-step Left Factoring for rexpr
- Common prefix:
rexpr
- Rewrite:
rexpr → rterm rexpr'
rexpr' → + rterm rexpr' | ε
For rterm
and rfactor
rterm → rfactor rterm'
rterm' → rfactor rterm' | ε
rfactor → rprimary rfactor'
rfactor' → * rfactor' | ε
Final Left-Factored Grammar
rexpr → rterm rexpr'
rexpr' → + rterm rexpr' | ε
rterm → rfactor rterm'
rterm' → rfactor rterm' | ε
rfactor → rprimary rfactor'
rfactor' → * rfactor' | ε
rprimary → a | b
This grammar is now suitable for top-down parsing (e.g., recursive descent):contentReference[oaicite:2]{index=2}.
4. Practical Applications
4.1 Top-Down Parsing and Backtracking
Top-down parsers build parse trees from the root downward using recursive function calls or parsing tables. The simplest form is the recursive-descent parser.
- With Backtracking: Tries all productions for a non-terminal in order. If parsing fails, it backtracks and tries the next one.
- Without Backtracking (Predictive Parsing): Uses FIRST and FOLLOW sets to pick the correct production immediately. LL(1) grammars support this:contentReference[oaicite:0]{index=0}.
Example (Recursive-Descent Parser):
void A() {
save input pointer;
for each production A → α:
if match(α) then return success;
else reset input pointer;
}
Backtracking is rarely used in compilers due to inefficiency. It may require scanning input multiple times:contentReference[oaicite:1]{index=1}.
4.2 Bottom-Up Parsing (Shift-Reduce)
Bottom-up parsing works in reverse: it builds the parse tree from leaves (tokens) up to the root (start symbol).
Shift-Reduce Parser Actions:
- Shift: Push the next input symbol onto the stack.
- Reduce: Replace a sequence of symbols on the stack with the left-hand side of a production.
- Accept: When the start symbol is on the stack and input is empty.
- Error: If no valid action exists for the stack/input configuration.
Example: Parsing id * id
STACK INPUT ACTION
$ id * id $ shift
$id * id $ reduce F → id
$F * id $ reduce T → F
$T * id $ shift
$T * id $ shift
$T * id $ reduce F → id
$T * F $ reduce T → T * F
$T $ reduce E → T
$E $ accept
It uses a stack and handles are detected and reduced one at a time:contentReference[oaicite:2]{index=2}.
4.3 Tools to Deal with Non-Determinism
Real-world compilers use parser generators to handle non-deterministic CFGs efficiently.
4.3.1 Predictive Parsers (LL)
- Tools: Hand-written or generated using
ANTLR
,JavaCC
. - Requirements: Grammar must be LL(1); needs FIRST/FOLLOW sets and left-factoring, left-recursion removal.
4.3.2 Shift-Reduce Parsers (LR Family)
- Tools:
Yacc
,Bison
,Menhir
,CUP
- Variants: SLR, Canonical LR, LALR
- LALR Parsers: Used in real compilers; compact like SLR, accurate like canonical LR
4.3.3 Dynamic Programming Parsers
- Earley Parser: Handles all CFGs, even ambiguous ones; suitable for natural languages
- CYK Parser: Based on table-filling; cubic time but powerful
These tools manage ambiguity and non-determinism by using lookahead, parse tables, and automaton states, turning impractical grammars into efficiently parsed constructs:contentReference[oaicite:3]{index=3}.