1. Understanding Deterministic CFG
A Deterministic Context-Free Grammar (DCFG) is a context-free grammar that allows parsing without any ambiguity or backtracking. It is the foundation for constructing predictive parsers (like LL(1)) and shift-reduce parsers (like LR(1)), where a single correct parse action exists for each input symbol and parser state.
1.1 What is a Deterministic CFG?
In a deterministic CFG, for any given input symbol and parsing context, the parser should be able to decide the next step uniquely. This contrasts with a general CFG, where multiple choices might exist, requiring backtracking or lookahead.
- Key Property: No ambiguity + unique parsing decision at each step
- Used in: Predictive parsing (LL), Shift-reduce parsing (LR)
For example, consider a grammar:
S → a A | a B
A → b
B → c
This is non-deterministic because after seeing 'a', the parser can't decide whether to choose a A
or a B
without further lookahead.
To make it deterministic, we can left-factor it:
S → a S'
S' → b | c
Now, after reading 'a', we move to a state where the next token will determine whether to go to 'b' or 'c'. This is LL(1) deterministic.
1.2 Why is Determinism Needed?
Real-world compilers must process code in linear time with high efficiency. Determinism ensures:
- No backtracking: Improves speed and predictability
- Error detection: Easier and faster to identify invalid input
- Parser generation: Tools like
YACC
,Bison
, andANTLR
require deterministic CFGs
1.2.1 Real-world Analogy
Imagine an ATM menu:
1. Withdraw
2. Deposit
3. Balance
You press "1", and the ATM takes you to "Withdraw". There's no ambiguity. If the ATM gave two different options after pressing "1", it would confuse the user. Similarly, a deterministic parser should always know the next step.
1.2.2 When is it a Problem?
Consider this CFG used in natural languages:
S → if E then S | if E then S else S | other
This is known as the "dangling else" problem. The ambiguity comes from not knowing which if
the else
belongs to. Real programming languages like C resolve this by associating else
with the nearest if
.
1.2.3 Real Compiler Impact
Without determinism:
- Parsing becomes O(n²) or worse
- Compiler becomes slow and buggy
- Tooling (like syntax highlighting, error messages) becomes unreliable
2. Eliminate Ambiguity
Ambiguity occurs when a grammar generates a string that can have more than one parse tree. This is a fundamental problem in parsing because it leads to confusion in semantic interpretation.
2.1 What is Ambiguity?
If a string derived from a grammar has two or more different leftmost derivations (or parse trees), the grammar is ambiguous.
Example:
E → E + E | E * E | id
String: id + id * id
This string has two parse trees:
- One where
+
is applied after*
: (id + (id * id)) - One where
*
is applied after+
: ((id + id) * id)
Both are valid derivations, but they imply different meanings. Compilers must produce exactly one interpretation — ambiguity breaks this requirement.
2.2 Why Eliminate Ambiguity?
In programming languages, ambiguity can cause:
- Incorrect code generation: Different parse trees = different instructions
- Unpredictable behavior: Compiler may pick arbitrary interpretation
- Parsing failures: Deterministic parsers (LL/LR) cannot handle ambiguity
2.3 Strategies to Eliminate Ambiguity
2.3.1 Enforce Operator Precedence and Associativity
To fix the arithmetic expression grammar, we split it by precedence:
E → E + T | T
T → T * F | F
F → id
*
has higher precedence than+
- Operators are left-associative (e.g., a + b + c = (a + b) + c)
2.3.2 Disambiguate with Language Rules
Classic case: "dangling else"
S → if E then S | if E then S else S | other
Make it unambiguous by associating else
with the nearest unmatched if
:
S → M | U
M → if E then M else M | other
U → if E then S
2.3.3 Rewrite the Grammar
Sometimes, restructuring rules removes ambiguity. Example:
Ambiguous:
S → a S b | ε
This allows multiple ways to generate the same string like aabb
.
Unambiguous version (balanced a's and b's):
S → a S b | ε
In this case, the ambiguity was not real — the original grammar was actually unambiguous. You must prove ambiguity formally before rewriting.
2.4 Real-World Compiler Example
In C/C++, the following expression is parsed differently depending on operator precedence:
a + b * c
The compiler refers to a deterministic grammar that gives *
higher precedence:
b * c
is evaluated first- Then
a + (result)
If grammar allowed both (a + b) * c
and a + (b * c)
, it would cause ambiguity. Therefore, operator tables are translated into disambiguated CFGs.
3. Remove Left Recursion
Left recursion in a grammar occurs when a non-terminal calls itself as the first symbol in one of its productions. Left-recursive grammars are problematic for top-down parsers (like LL(1)) because they can cause infinite recursion.
3.1 What is Left Recursion?
A grammar has left recursion if there exists a non-terminal A
such that:
A → A α | β
where α
and β
are sequences of terminals and/or non-terminals, and β
does not start with A
.
Example of left recursion:
E → E + T | T
Here, E
is calling itself first, directly. For LL parsers, this causes an infinite loop.
3.2 Eliminate Immediate Left Recursion
The standard technique is to refactor the grammar by introducing a new non-terminal to handle recursion.
General Rule:
A → A α₁ | A α₂ | ... | β₁ | β₂ | ...
where each βᵢ
does not start with A. This is transformed into:
A → β₁ A' | β₂ A' | ...
A' → α₁ A' | α₂ A' | ... | ε
3.2.1 Example: Arithmetic Expressions
Original grammar (left-recursive):
E → E + T | T
T → T * F | F
F → ( E ) | id
Step-by-step removal of left recursion:
For E:
- Left-recursive: E → E + T
- Non-left-recursive: E → T
Apply transformation:
E → T E'
E' → + T E' | ε
For T:
T → F T'
T' → * F T' | ε
F remains unchanged:
F → ( E ) | id
This grammar is now free from left recursion and suitable for LL(1) parsing.
3.3 Why Remove Left Recursion?
- Top-down parsers use recursive function calls to simulate derivation.
- Left recursion causes these calls to recurse infinitely.
- Removing it enables predictive parsing.
3.4 Real-world Compiler Scenario
In compilers like javac
or g++
, parser generators (like ANTLR or YACC) reject grammars with direct left recursion if configured for top-down parsing.
To write a parser that understands arithmetic expressions, you must first eliminate left recursion as shown above.
4. Apply Left Factoring
Left factoring is a grammar transformation technique used to remove common prefixes from multiple productions of a non-terminal. It is essential when a parser cannot decide which production to choose based on the next input symbol.
4.1 What is Left Factoring?
Given a non-terminal A
with multiple productions starting with the same prefix:
A → α β₁ | α β₂
This makes the parser unable to choose between β₁
and β₂
just by looking at α
. Left factoring rewrites this as:
A → α A'
A' → β₁ | β₂
Now the decision is delayed until enough input is seen.
4.2 Why Apply Left Factoring?
- Predictive parsing requires a single production per lookahead symbol.
- Reduces ambiguity when common prefixes exist.
- Enables LL(1) parser construction with FIRST/FOLLOW sets.
4.2.1 Real-World Analogy
Imagine a voice menu:
Say "book flight"
Say "book hotel"
The system hears "book" and gets confused. Instead, restructure:
Say "book"
Then say "flight" or "hotel"
Left factoring works the same way — delay the decision.
4.3 Detailed Example
Original Grammar:
S → if E then S else S | if E then S | other
This grammar is ambiguous due to common prefix if E then S
. Parser cannot decide between the two branches without lookahead.
Left-factored version:
S → if E then S S'
S' → else S | ε
S → other
Now, after reading if E then S
, the parser checks whether else
follows. If yes, parse else S
. If not, move on.
4.3.1 Another Example
Original Grammar:
A → int id ; | int id = num ;
Common prefix: int id
Left-factored:
A → int id A'
A' → ; | = num ;
Now the parser reads int id
and decides based on the next token.
4.4 Implementation Note
Left factoring is automated in many parser generators. But for handwritten parsers, this step is manual and necessary to avoid parse conflicts.
After left factoring, the grammar becomes suitable for constructing parsing tables or recursive descent parsers.
5. Construct FIRST and FOLLOW Sets
To build an LL(1) parser or check grammar determinism, we must compute two essential sets for each non-terminal:
- FIRST: What terminals can appear first in a derivation from a non-terminal?
- FOLLOW: What terminals can appear immediately after a non-terminal in any valid string?
These sets ensure deterministic parsing decisions by guiding which production to apply based on lookahead symbols.
5.1 FIRST Set
The FIRST(X) of a symbol X is the set of terminals that begin strings derived from X. It includes:
- The terminal itself if X is a terminal
- FIRST of the symbols in the production if X is a non-terminal
ε
if X can derive ε
Rules to Compute FIRST
- If X is a terminal, then FIRST(X) = {X}
- If X → ε is a production, then ε ∈ FIRST(X)
- If X → Y₁ Y₂ ... Yₙ, then:
- add FIRST(Y₁) excluding ε
- if FIRST(Y₁) contains ε, add FIRST(Y₂), and so on
- if all Yᵢ derive ε, add ε to FIRST(X)
5.2 FOLLOW Set
The FOLLOW(A) of a non-terminal A is the set of terminals that can appear immediately after A in some derivation.
Rules to Compute FOLLOW
- If S is the start symbol, add
$
to FOLLOW(S) - If A → α B β, then add FIRST(β) \ {ε} to FOLLOW(B)
- If A → α B or A → α B β and FIRST(β) contains ε, add FOLLOW(A) to FOLLOW(B)
5.3 Detailed Example
Consider this grammar:
S → A B
A → a A | ε
B → b B | c
Step 1: Compute FIRST
FIRST(A)
- A → a A → starts with
a
- A → ε → includes ε
- ⇒ FIRST(A) = {a, ε}
FIRST(B)
- B → b B → starts with
b
- B → c → starts with
c
- ⇒ FIRST(B) = {b, c}
FIRST(S)
- S → A B
- FIRST(A) = {a, ε}
- If A ⇒ ε, then consider FIRST(B) too
- ⇒ FIRST(S) = FIRST(A) ∪ FIRST(B) = {a, b, c}
Step 2: Compute FOLLOW
FOLLOW(S)
- S is start symbol ⇒
$
∈ FOLLOW(S)
FOLLOW(A)
- S → A B ⇒ FIRST(B) = {b, c} ⊆ FOLLOW(A)
- ⇒ FOLLOW(A) = {b, c}
FOLLOW(B)
- B is at the end of S → A B ⇒ FOLLOW(S) = {$} ⊆ FOLLOW(B)
- ⇒ FOLLOW(B) = {$}
Non-terminal | FIRST | FOLLOW |
---|---|---|
S | {a, b, c} | {$} |
A | {a, ε} | {b, c} |
B | {b, c} | {$} |
5.4 How It Enables Determinism
LL(1) table uses:
- FIRST to choose rule on lookahead
- FOLLOW when ε appears in FIRST, to decide when to apply the ε-production
Example: If input starts with a
, parser chooses A → a A. If input is b
or c
, parser chooses A → ε and proceeds with B.
6. Check LL(1) Conditions
A grammar is said to be LL(1) if it can be parsed by an LL(1) parser — a top-down parser that uses:
- Left-to-right scanning of input (first L)
- Leftmost derivation (second L)
- 1-symbol lookahead (1)
For LL(1) parsing, we need to ensure that at every step, the parser can uniquely determine the correct production using only the next input symbol. This requires checking two specific conditions using the FIRST and FOLLOW sets.
6.1 LL(1) Grammar Conditions
For any non-terminal A
with two or more productions:
A → α | β
The grammar is LL(1) only if:
- Condition 1: FIRST(α) ∩ FIRST(β) = ∅
- Condition 2: If ε ∈ FIRST(α), then FIRST(β) ∩ FOLLOW(A) = ∅
- Similarly, if ε ∈ FIRST(β), then FIRST(α) ∩ FOLLOW(A) = ∅
6.2 Detailed Example
Consider the grammar:
S → A B
A → a A | ε
B → b B | c
Step 1: Use Previously Computed FIRST and FOLLOW
Non-terminal | FIRST | FOLLOW |
---|---|---|
S | {a, b, c} | {$} |
A | {a, ε} | {b, c} |
B | {b, c} | {$} |
Step 2: Check LL(1) Conditions for A → a A | ε
- FIRST(a A) = {a}
- FIRST(ε) = {ε}
- ⇒ FIRST(a A) ∩ FIRST(ε) = ∅ ✅
- Since ε ∈ FIRST(A), check FIRST(a A) ∩ FOLLOW(A) = {a} ∩ {b, c} = ∅ ✅
Step 3: Check LL(1) Conditions for B → b B | c
- FIRST(b B) = {b}
- FIRST(c) = {c}
- ⇒ FIRST(b B) ∩ FIRST(c) = ∅ ✅
✅ All conditions are satisfied. Therefore, the grammar is LL(1).
6.3 When Grammar is Not LL(1)
Example of an ambiguous grammar that fails LL(1) condition:
E → E + T | T
FIRST(E + T) = FIRST(E), which causes recursion
We remove left recursion and left factor to make it LL(1)-compliant (as done in earlier sections):
E → T E'
E' → + T E' | ε
Check LL(1) for E' → + T E' | ε
- FIRST(+ T E') = {+}
- FIRST(ε) = {ε}
- FOLLOW(E') = {$, )}
- ⇒ {+} ∩ FOLLOW(E') = ∅ ✅
✅ LL(1) satisfied after transformation.
6.4 Practical Compiler Insight
Most parser generators like ANTLR
or Bison
show a warning when LL(1) conflicts exist. For real-time applications like IDE autocompletion, LL(1) parsing ensures instant and correct parsing without delay or backtracking.
7. Build the Predictive Parsing Table
The Predictive Parsing Table is a two-dimensional table used by LL(1) parsers to decide which production rule to apply, based on:
- Current non-terminal (row)
- Next input symbol (column)
This table enables parsing with one lookahead symbol — hence "LL(1)". It is built using the FIRST
and FOLLOW
sets.
7.1 Construction Steps
- For each production
A → α
:- For each terminal
a
∈ FIRST(α), addA → α
toTABLE[A, a]
- If
ε ∈ FIRST(α)
, then for each terminalb
∈ FOLLOW(A), addA → α
toTABLE[A, b]
- For each terminal
- Mark each invalid entry as error or blank
This ensures that only one entry per cell exists — the defining condition of LL(1).
7.2 Example Grammar
S → A B
A → a A | ε
B → b B | c
From earlier computations:
Non-terminal | FIRST | FOLLOW |
---|---|---|
S | {a, b, c} | {$} |
A | {a, ε} | {b, c} |
B | {b, c} | {$} |
Productions:
S → A B
A → a A | ε
B → b B | c
7.3 FIRST + FOLLOW Table Fill
Non-terminal | a | b | c | $ |
---|---|---|---|---|
S | S → A B | S → A B | S → A B | error |
A | A → a A | A → ε | A → ε | error |
B | error | B → b B | B → c | error |
✅ One production per cell → this grammar is LL(1) and parsing table is ready.
7.4 Real-world Analogy
Think of the parsing table like a GPS system:
- You are at a "non-terminal" (current location)
- You see a "lookahead symbol" (road sign)
- GPS (table) tells you the exact route (production) to follow
Because only one route exists for each sign at each location, you never get lost — that’s the power of LL(1).
7.5 Use in Predictive Parser
The parser uses the table as follows:
stack ← [S, $]
input ← [tokens, $]
while stack not empty:
X = top of stack
a = current input symbol
if X is terminal or $:
if X == a: pop both
else: error
else:
if TABLE[X, a] = X → α:
replace X with α on stack
else:
error
8. Validate Determinism
Validation of determinism ensures the grammar and parsing table conform to the LL(1) requirements — one unique parsing decision for each input symbol at every step. This validation confirms that:
- Each cell in the parsing table contains at most one production rule
- No conflicts exist (no multiple entries or ambiguity)
- FIRST and FOLLOW sets do not overlap improperly
This is the final step to ensure the CFG is now deterministic and parser-ready.
8.1 What Makes a Grammar Non-Deterministic?
A grammar fails determinism if:
FIRST(α) ∩ FIRST(β) ≠ ∅
for two productionsA → α | β
- ε ∈ FIRST(α) and
FIRST(β) ∩ FOLLOW(A) ≠ ∅
- Parsing table has multiple entries for any [Non-terminal, terminal] pair
Example of Conflict
A → a B | a C
BOTH start with a
⇒ FIRST sets overlap ⇒ Not LL(1)
8.2 How to Validate Determinism
- Construct the FIRST and FOLLOW sets
- Build the predictive parsing table
- Scan each table cell:
- Check that each [Non-terminal, Terminal] has ≤ 1 rule
- If any cell has more than one rule, grammar is NOT LL(1)
Automated Validator Logic:
for non_terminal in grammar:
for terminal in terminals:
if len(table[non_terminal][terminal]) > 1:
print("Grammar is not LL(1)")
8.3 Validated Example (From Previous Sections)
Recall:
S → A B
A → a A | ε
B → b B | c
The predictive parsing table:
Non-terminal | a | b | c | $ |
---|---|---|---|---|
S | S → A B | S → A B | S → A B | error |
A | A → a A | A → ε | A → ε | error |
B | error | B → b B | B → c | error |
✅ Each cell contains ≤ 1 rule. Grammar is deterministic.
8.4 Real-World Usage
- Parser generators like
ANTLR
orBison
will flag LL(1) violations - IDE autocompletion, code linters, and syntax highlighters depend on deterministic grammars to work without lag
Without validation, hidden non-determinism causes parsing bugs, miscompilation, or infinite loops.
9. Detailed Step By Step Example From Book
Consider the following grammar:
stmt → if expr then stmt
| if expr then stmt else stmt
| other
The grammar is ambiguous because the else
can be associated with either of the two if
statements. This is known as the "dangling else" problem.
We will go through the steps to make this grammar deterministic.
- Step 1: Identify Ambiguity
- Step 2: Rewrite the Grammar
- Step 3: Remove Left Recursion
- Step 4: Apply Left Factoring
- Step 5: Construct FIRST and FOLLOW Sets
- Step 6: Check LL(1) Conditions
- Step 7: Build the Predictive Parsing Table
- Step 8: Validate Determinism
After following these steps, we can conclude that the grammar is now deterministic and suitable for LL(1) parsing.
In practice, this process is often automated by parser generators, but understanding the underlying principles is crucial for effective grammar design.
9.1 Eliminate Ambiguity — Step-by-Step (Dangling-Else Problem)
This classic ambiguity occurs in conditional statements where an else
can be associated with more than one if
. The example from the book illustrates the problem and the transformation to eliminate it.
9.1.1 Original Ambiguous Grammar
stmt → if expr then stmt
| if expr then stmt else stmt
| other
This grammar is ambiguous. For instance, the sentence:
if E1 then if E2 then S1 else S2
can be parsed in two ways — either associating the else
with the first if
or the second. But programming languages like C, Java, etc., match each else with the closest unmatched then — this preference must be encoded into an unambiguous grammar.
9.1.2 Disambiguation Strategy
Split statements into two categories:
- Matched: Fully enclosed if-then-else (no ambiguity)
- Open: May have unmatched then
This transformation ensures else
is always associated with the nearest unmatched then
.
9.1.3 Unambiguous Grammar
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then stmt
| if expr then matched_stmt else open_stmt
Now, for:
if E1 then if E2 then S1 else S2
It parses as:
stmt
└── open_stmt
└── if expr then matched_stmt else open_stmt
└── matched_stmt → if expr then matched_stmt else matched_stmt
This guarantees else
always binds to the nearest then
.
9.1.4 Real-World Example
In Java/C:
if (x > 0)
if (y > 0)
System.out.println("A");
else
System.out.println("B");
The else
binds to the inner if
. The rewritten unambiguous grammar matches this behavior exactly, which is why it is preferred in compiler frontend design.
9.1.5 Summary Table
Form | Meaning | Handled by |
---|---|---|
if E1 then if E2 then S1 else S2 |
Associates else with inner if | Unambiguous Grammar |
if E1 then if E2 then S1 |
No ambiguity | Both grammars |
if E1 then S1 else if E2 then S2 else S3 |
Right association of elses | Unambiguous Grammar |
9.1.6 Why This Matters
- Ambiguity causes confusion for parser generators (LL/LR)
- Ambiguous grammars can't be parsed without extra logic
- Unambiguous grammar directly maps source to single parse tree
9.2 Remove Left Recursion — Step-by-Step on Disambiguated If-Else Grammar
In the unambiguous grammar constructed to eliminate the dangling-else ambiguity, we now check and eliminate left recursion to make it suitable for top-down (LL(1)) parsing.
Let’s recall the grammar from 9.1:
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt | other
open_stmt → if expr then stmt | if expr then matched_stmt else open_stmt
None of these productions are directly left-recursive, but they do contain potential indirect left recursion. Let’s analyze step-by-step.
9.2.1 Step 1: List All Non-Terminals
- S = stmt
- M = matched_stmt
- O = open_stmt
We’ll check for left recursion in each non-terminal: stmt, matched_stmt, open_stmt.
9.2.2 Step 2: Substitution for Indirect Recursion
First substitute stmt → matched_stmt | open_stmt
into the productions where stmt
appears, namely in open_stmt
:
open_stmt → if expr then matched_stmt
| if expr then open_stmt
| if expr then matched_stmt else open_stmt
The rule if expr then open_stmt
has the non-terminal open_stmt
as the first symbol after expansion ⇒ this is direct left recursion.
9.2.3 Step 3: Eliminate Direct Left Recursion in open_stmt
We now isolate the recursive and non-recursive parts of open_stmt
.
Recursive form:
open_stmt → if expr then open_stmt
Non-recursive forms:
open_stmt → if expr then matched_stmt
open_stmt → if expr then matched_stmt else open_stmt
Apply standard transformation:
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
This eliminates the left recursion while preserving the structure.
9.2.4 Final Refactored Grammar (Left Recursion Removed)
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
Explanation:
- open_stmt now begins with a non-recursive form only
- open_stmt' handles the repeated chaining of
else open_stmt
recursively (right-recursion only)
9.2.5 Parse Tree Example (No Left Recursion)
For the string:
if E1 then matched_stmt else open_stmt else open_stmt
Derivation:
open_stmt
→ if expr then matched_stmt open_stmt'
→ if expr then matched_stmt else open_stmt open_stmt'
→ if expr then matched_stmt else open_stmt else open_stmt open_stmt'
At each stage, we apply only one rule — determinism is preserved and no left recursion occurs.
9.2.6 Why This Matters
- Removes infinite loops in recursive-descent parsers
- Enables FIRST and FOLLOW computation
- Makes grammar suitable for predictive parsing (LL(1))
9.3 Apply Left Factoring — Step-by-Step on Disambiguated & Recursion-Free Grammar
Now that we have eliminated ambiguity and left recursion, the final step before constructing the parsing table is to apply left factoring. This transformation ensures that the parser can always make a deterministic choice based on a single lookahead symbol.
9.3.1 Refactored Grammar from Previous Step
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
We will inspect each non-terminal and factor out any common prefixes.
9.3.2 Step 1: Left Factoring matched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
No common prefix between the two alternatives: if expr then ...
and other
. ✅ No factoring needed here.
9.3.3 Step 2: Left Factoring open_stmt
open_stmt → if expr then matched_stmt open_stmt'
Only one production exists, so ✅ nothing to factor.
9.3.4 Step 3: Left Factoring open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
Here too, the productions are else ...
and ε
, with no overlapping prefix. ✅ No factoring needed.
9.3.5 Is Left Factoring Required at All?
No, because:
- Each non-terminal has unique prefixes
- No two productions of the same non-terminal begin with the same token (terminal)
- The grammar is already in LL(1)-ready form after ambiguity removal and left recursion removal
9.3.6 Final LL(1)-Ready Grammar (No Left Factoring Needed)
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
✅ Already suitable for predictive parsing table construction.
9.3.7 Real-World Impact
- Parser never needs backtracking or multi-symbol lookahead
- Code completion engines in IDEs like VSCode rely on such deterministic grammars
- This structure reflects real language design choices (e.g., C/C++ associating else with nearest if)
9.4 Construct FIRST and FOLLOW Sets — Step-by-Step on Final Grammar
To build a predictive parser, we must compute FIRST and FOLLOW sets for each non-terminal in the grammar. These sets tell us which production to choose when parsing a symbol and help handle ε
-productions properly.
9.4.1 Final Grammar
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
Non-terminals: stmt, matched_stmt, open_stmt, open_stmt'
Terminals: if
, expr
, then
, else
, other
Note: expr
is treated as a terminal (assumed lexical unit).
9.4.2 Compute FIRST Sets
FIRST(matched_stmt)
matched_stmt → if expr then matched_stmt else matched_stmt
→ other
Starts with either if
or other
FIRST(matched_stmt) = {if, other}
FIRST(open_stmt)
open_stmt → if expr then matched_stmt open_stmt'
Starts with if
FIRST(open_stmt) = {if}
FIRST(open_stmt')
open_stmt' → else open_stmt open_stmt' | ε
FIRST(open_stmt') = {else, ε}
FIRST(stmt)
stmt → matched_stmt | open_stmt
FIRST(matched_stmt) = {if, other}, FIRST(open_stmt) = {if}
⇒ Combined: FIRST(stmt) = {if, other}
9.4.3 Compute FOLLOW Sets
Rules:
- Add
$
to FOLLOW(start symbol) - If A → αBβ, then FIRST(β) \ {ε} ⊆ FOLLOW(B)
- If A → αB or A → αBβ and ε ∈ FIRST(β), then FOLLOW(A) ⊆ FOLLOW(B)
FOLLOW(stmt)
- stmt is start symbol ⇒ FOLLOW(stmt) = {$}
FOLLOW(matched_stmt)
Occurs in:
- stmt → matched_stmt → FOLLOW(stmt) = {$}
- open_stmt → if expr then matched_stmt open_stmt'
- ⇒ FIRST(open_stmt') = {else, ε}
- matched_stmt → if expr then matched_stmt else matched_stmt
- matched_stmt before and after 'else'
So we add:
else
from FIRST(open_stmt')$
from FOLLOW(stmt)
FOLLOW(matched_stmt) = {else, $}
FOLLOW(open_stmt)
- stmt → open_stmt ⇒ FOLLOW(stmt) = {$}
- open_stmt' → else open_stmt open_stmt'
- ⇒ FOLLOW(open_stmt') ⊆ FOLLOW(open_stmt)
- ⇒ include FOLLOW(open_stmt') (to be determined)
Assume temporary FOLLOW(open_stmt') = X
So far: FOLLOW(open_stmt) = {else, $}
FOLLOW(open_stmt')
- From open_stmt → if expr then matched_stmt open_stmt'
- ⇒ FOLLOW(open_stmt) ⊆ FOLLOW(open_stmt')
- From open_stmt' → else open_stmt open_stmt'
- ⇒ FOLLOW(open_stmt') ⊆ FOLLOW(open_stmt')
- Recursive, but no new terminals
Eventually: FOLLOW(open_stmt') = FOLLOW(open_stmt) = {else, $}
9.4.4 Final FIRST and FOLLOW Table
Non-terminal | FIRST | FOLLOW |
---|---|---|
stmt | {if, other} | {$} |
matched_stmt | {if, other} | {else, $} |
open_stmt | {if} | {else, $} |
open_stmt' | {else, ε} | {else, $} |
9.4.5 Why This Step is Crucial
- Used directly to build the LL(1) parsing table
- Validates correctness and determinism of the grammar
- Reveals any hidden ambiguity or conflict via FIRST/FOLLOW overlap
9.5 Check LL(1) Conditions — Step-by-Step
To validate that the grammar is LL(1), we must confirm that at every choice point for a non-terminal, the parser can make a unique decision using a single lookahead symbol.
This is done using FIRST and FOLLOW sets from step 9.4.
9.5.1 LL(1) Conditions to Check
For each non-terminal A
with multiple productions A → α | β | ...
:
- Condition 1: FIRST(α) ∩ FIRST(β) = ∅
- Condition 2: If ε ∈ FIRST(α), then FIRST(β) ∩ FOLLOW(A) = ∅
If these conditions hold for all production choices, the grammar is LL(1).
9.5.2 Grammar
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
9.5.3 Step-by-Step LL(1) Check
Check stmt → matched_stmt | open_stmt
- FIRST(matched_stmt) = {if, other}
- FIRST(open_stmt) = {if}
- ⇒ {if} ∩ {if, other} ≠ ∅ ❌
- Conflict: Overlap on 'if' → violates Condition 1
However, this is a false conflict because matched_stmt and open_stmt differ structurally, and parser can be rewritten using a single entry for 'if'. But in strict LL(1), this overlap is not allowed unless handled via factoring or lookahead.
Check matched_stmt → if ... | other
- FIRST(if ...) = {if}
- FIRST(other) = {other}
- ⇒ {if} ∩ {other} = ∅ ✅
Check open_stmt' → else open_stmt open_stmt' | ε
- FIRST(else open_stmt open_stmt') = {else}
- FIRST(ε) = {ε}
- FOLLOW(open_stmt') = {else, $}
- ⇒ {else} ∩ FOLLOW(open_stmt') = {else} ❌
- Violates Condition 2 because
ε
is in FIRST and FIRST(other alt) ∩ FOLLOW ≠ ∅
This means the parser may not know whether to apply ε or else-rule upon seeing 'else' — ambiguity for predictive parser.
9.5.4 Final Verdict
Non-terminal | Conflict? | Reason |
---|---|---|
stmt | ❌ | FIRST(matched_stmt) ∩ FIRST(open_stmt) = {if} |
matched_stmt | ✅ | FIRSTs disjoint → {if} ∩ {other} = ∅ |
open_stmt' | ❌ | FIRST(else) ∩ FOLLOW(open_stmt') = {else} |
So while the grammar was made unambiguous and left-recursion free, it is not strictly LL(1) due to the FIRST/FOLLOW conflicts.
Possible Fix?
- Merge stmt → open_stmt only and handle matched_stmt structurally inside
- Use parser with lookahead or semantic check
- Or redesign
stmt
productions usingif_stmt
to split concern
9.5.5 Real Compiler Insight
- Many real compilers use LR parsers or GLR parsers to handle such patterns
- Some use disambiguating rules like "else binds to nearest if"
- Still, LL(1) is useful in hand-written or simple parser cases
9.6 Build the Predictive Parsing Table — Step-by-Step
To construct the LL(1) Predictive Parsing Table, we use the FIRST and FOLLOW sets calculated in step 9.4 and validate their entries using step 9.5. Each cell in the table is filled with a production rule according to these rules:
Table Construction Rules:
- For each production
A → α
, and for each terminala ∈ FIRST(α)
, addA → α
toTABLE[A, a]
- If ε ∈ FIRST(α), then for each terminal
b ∈ FOLLOW(A)
, addA → α
toTABLE[A, b]
Note: If a cell already has an entry, it indicates a conflict (non-LL(1)).
We will now fill the parsing table for our grammar step-by-step.
9.6.1 Grammar Recap
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
Start Symbol: stmt
9.6.2 FIRST and FOLLOW (from step 9.4)
Non-terminal | FIRST | FOLLOW |
---|---|---|
stmt | {if, other} | {$} |
matched_stmt | {if, other} | {else, $} |
open_stmt | {if} | {else, $} |
open_stmt' | {else, ε} | {else, $} |
9.6.3 Fill the Predictive Parsing Table
Non-terminal | if | other | else | $ |
---|---|---|---|---|
stmt | stmt → matched_stmt stmt → open_stmt ⚠️ |
stmt → matched_stmt | error | error |
matched_stmt | matched_stmt → if expr then matched_stmt else matched_stmt | matched_stmt → other | error | error |
open_stmt | open_stmt → if expr then matched_stmt open_stmt' | error | error | error |
open_stmt' | error | error | open_stmt' → else open_stmt open_stmt' | open_stmt' → ε |
⚠️ LL(1) Conflict at [stmt, if]
Due to overlapping FIRST sets of matched_stmt
and open_stmt
. This is what makes the grammar non-LL(1).
9.6.4 Final Notes
- Except for
stmt
, all non-terminals are LL(1) compliant - Conflicting entry in [stmt, if] → more than one production applicable
- Resolved using:
- Grammar transformation (refactor
stmt
) - Or switch to a more powerful parser (e.g. LL(k), LR)
- Grammar transformation (refactor
9.6.5 Real-World Compiler Relevance
- Parsing table is used in recursive-descent parser to drive parsing logic
- Conflicts indicate a need to redesign grammar or increase lookahead
- In tools like ANTLR, this table is built behind the scenes but essential to parser generation
9.7 Simulate Parsing with Stack — Step-by-Step
We simulate how a predictive parser uses the parsing table and a stack to parse an input string. The parser maintains:
- A stack initialized with the start symbol and
$
(end-marker) - An input buffer containing tokens followed by
$
- A parsing table to decide which production to apply based on [top of stack, lookahead]
We simulate this using our grammar from step 9.6.
9.7.1 Grammar Recap (Final Form)
stmt → matched_stmt | open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then matched_stmt open_stmt'
open_stmt' → else open_stmt open_stmt' | ε
9.7.2 Input String to Parse
if expr then other else other $
Token stream: if, expr, then, other, else, other, $
9.7.3 Stack Parsing Simulation Table
Start stack: [stmt, $]
Step | Stack | Input | Action |
---|---|---|---|
1 | stmt, $ | if expr then other else other $ | stmt → matched_stmt (from table) |
2 | matched_stmt, $ | if expr then other else other $ | matched_stmt → if expr then matched_stmt else matched_stmt |
3 | if, expr, then, matched_stmt, else, matched_stmt, $ | if expr then other else other $ | Match if |
4 | expr, then, matched_stmt, else, matched_stmt, $ | expr then other else other $ | Match expr |
5 | then, matched_stmt, else, matched_stmt, $ | then other else other $ | Match then |
6 | matched_stmt, else, matched_stmt, $ | other else other $ | matched_stmt → other |
7 | other, else, matched_stmt, $ | other else other $ | Match other |
8 | else, matched_stmt, $ | else other $ | Match else |
9 | matched_stmt, $ | other $ | matched_stmt → other |
10 | other, $ | other $ | Match other |
11 | $ | $ | Parsing successful ✅ |
9.7.4 Key Observations
- Each step used a single production from the parsing table
- Stack mirrors the expected derivation
- No backtracking or ambiguity occurred
- Parser terminated successfully when both input and stack reached
$
9.7.5 Real-World Use
- Recursive-descent parsers simulate this logic with real function calls instead of a stack
- Compiler frontend uses such simulation to parse programs token-by-token
- Errors are caught by mismatches in top-of-stack vs input lookahead