Deterministic CFG - CSU086

Deterministic Context-Free Grammar

1. Deterministic Context-Free Grammar (DCFG)

A Deterministic Context-Free Grammar (DCFG) is a special kind of Context-Free Grammar (CFG) whose strings can be parsed by a Deterministic Pushdown Automaton (DPDA). This means that at any step in parsing, there is at most one valid move based on the current state, stack top, and input symbol—no guessing or backtracking required.

1.1 What is DCFG?

DCFG is a subset of CFG. Just like every square is a rectangle but not every rectangle is a square, every DCFG is a CFG, but not every CFG is a DCFG.

DCFG: Can be parsed using a deterministic pushdown automaton.
CFG: May require non-deterministic pushdown automata to parse.

Think of DCFG as a traffic system where only one turn is allowed at each signal. In contrast, CFG may have multiple valid roads at each signal, requiring a map or trial-and-error.

1.2 Why is DCFG Needed?

DCFGs are the foundation of efficient parsers like LL(1), LR(1), and LALR(1), which are used in real-world programming languages such as C, Java, and Python.

They provide:

Determinism: Removes ambiguity—only one valid way to parse.
Efficiency: Linear-time parsing using stack-based methods.
Implementability: Suitable for compiler design and syntax analyzers.

Example from the real world: Imagine you're building a compiler. DCFG ensures you can write a fast, predictable parser without backtracking. This is crucial when compiling millions of lines of code.

1.3 Difference Between CFG and DCFG

Aspect	CFG	DCFG
Parsing Machine	NPDA (Non-deterministic PDA)	DPDA (Deterministic PDA)
Backtracking	Allowed	Not Allowed
Ambiguity	Can be ambiguous	Must be unambiguous
Efficiency	Less efficient for large grammars	Highly efficient (used in real-world parsers)
Examples	Natural language, ambiguous constructs	Programming languages like C, Java

1.4 Real-World Example

CFG Example: Consider a grammar that generates expressions like if-then-else.


S → if E then S else S | if E then S | a

This CFG is ambiguous. A string like if E then if E then a else a has two different parse trees. So it is not deterministic.

DCFG Example: Modify the grammar to enforce determinism:


S → M else S | a
M → if E then S

This grammar ensures there's only one parse tree by delaying the parsing of else until a full if-then is complete.

1.5 Visualizing the Automaton

stateDiagram-v2
    [*] --> q0
    q0 --> q1 : read 'if'
    q1 --> q2 : read 'then'
    q2 --> q3 : read 'else'
    q3 --> [*]

This shows a DPDA handling a deterministic grammar with no state conflict.

2. LL(1), LR(1), LALR(1) Grammars

These are types of Deterministic Context-Free Grammars (DCFGs) used in real-world compilers. The names reflect how they parse and how much lookahead they use:

LL(1): Left-to-right scan, Leftmost derivation, 1 token lookahead.
LR(1): Left-to-right scan, Rightmost derivation in reverse, 1 token lookahead.
LALR(1): LookAhead LR – merges states of LR(1) to reduce table size.

2.1 LL(1) Grammar

LL(1) parsers are top-down and predictive. They require the grammar to be:

Left-factored: No common prefixes in rules.
No left recursion: Cannot call itself first in the production rule.

Real-world analogy: Reading a document line by line, making a decision without going back.

2.1.1 Example: LL(1) Grammar


E  → T E'
E' → + T E' | ε
T  → F T'
T' → * F T' | ε
F  → ( E ) | id

This grammar has no left recursion and no common prefixes, so it's LL(1).

2.1.2 LL(1) Parsing Table Construction


Follow(E)  = { $, ) }
First(E)   = { (, id }

Table:
+-------+--------+---------+-------+------+
| NonT  |   id   |   (     |   )   |  $   |
+-------+--------+---------+-------+------+
|   E   | E → T E'| E → T E'|       |      |
|  E'   |        |         | E'→ε  |E'→ε  |
|   T   | T → F T'|T → F T'|       |      |
|  T'   |        |         | T'→ε |T'→ε |
|   F   |F→id    |F→( E )  |       |      |
+-------+--------+---------+-------+------+

2.2 LR(1) Grammar

LR(1) parsers are bottom-up and shift-reduce based. They are powerful and can parse all DCFGs, including those with left recursion.

Real-world analogy: Reading a sentence backwards and reconstructing the grammar rules used to create it.

2.2.1 Example: LR(1) Grammar


S → L = R | R
L → * R | id
R → L

This grammar is not LL(1) due to ambiguity, but it is LR(1).

2.2.2 LR(1) Item and State


Item: [S → • L = R, $]

Closure and GOTO algorithms build a DFA from such items.

2.3 LALR(1) Grammar

LALR(1) is a practical compromise: same power as LR(1) for most programming languages but with fewer states. It's the default in most parser generators (YACC, Bison).

2.3.1 Difference: LR(1) vs LALR(1)

LR(1): Full lookahead and item distinction → large table
LALR(1): Merges states with same core items → smaller table, risk of conflicts

2.3.2 Merging States in LALR


LR(1) Item Set 1: [A → α •, a]
LR(1) Item Set 2: [A → α •, b]

LALR merges them:
[A → α •, a/b]

If merging introduces shift-reduce or reduce-reduce conflicts, grammar must be refactored.

2.4 Comparison Summary

Parser	Direction	Derivation	Lookahead	Strength	Speed	Used In
LL(1)	Left-to-right	Leftmost	1	Simple, fast	Very fast	Handwritten parsers, ANTLR
LR(1)	Left-to-right	Rightmost (reverse)	1	Powerful	Moderate	Bison, YACC (full mode)
LALR(1)	Left-to-right	Rightmost (reverse)	1	Balanced	Fast	Most practical compilers

2.5 Real World Scenarios

C/C++ compilers use LALR(1) parsers for speed and size optimization.
Interpreter tools like ANTLR use LL(*) variations for user-friendliness and custom grammars.
High-performance parsers for systems programming languages prefer LR(1) for accuracy and grammar depth.

3. LL(1) Grammar and DCFG

LL(1) is a subclass of Deterministic Context-Free Grammar (DCFG). It is the simplest and most efficient form of parsing grammar, ideal for top-down parsers known as Predictive Parsers. The ‘1’ indicates one-symbol lookahead is enough to make parsing decisions without backtracking.

3.1 Definition of LL(1)

An LL(1) grammar is a type of context-free grammar that can be parsed by a top-down parser:

L: Left-to-right scanning of the input.
L: Leftmost derivation of the production.
1: One lookahead symbol is used to decide the parsing action.

Think of it like reading instructions step-by-step without needing to look ahead more than one word. This makes the parser very fast and memory-efficient.

3.2 Conditions for LL(1) Grammar

To qualify as LL(1), a grammar must satisfy the following:

No Left Recursion: A non-terminal should not call itself as the first symbol on the right-hand side.
: If two productions for the same non-terminal begin with the same symbol, they must be rewritten to eliminate the ambiguity.
Disjoint FIRST and FOLLOW Sets: For any non-terminal A with productions A → α | β, the following must hold:
- FIRST(α) ∩ FIRST(β) = ∅
- If ε ∈ FIRST(α), then FIRST(β) ∩ FOLLOW(A) = ∅

This ensures that the parser can always choose the correct production with only one token of lookahead.

3.3 LL(1) Parsing Table Example

Given the grammar:


E  → T E'
E' → + T E' | ε
T  → F T'
T' → * F T' | ε
F  → ( E ) | id

Step 1: Compute FIRST and FOLLOW


FIRST(E)  = FIRST(T) = FIRST(F) = { (, id }
FIRST(E') = { +, ε }
FIRST(T') = { *, ε }
FOLLOW(E)  = { ), $ }
FOLLOW(E') = FOLLOW(E) = { ), $ }
FOLLOW(T)  = FIRST(E') ∪ FOLLOW(E) = { +, ), $ }
FOLLOW(T') = FOLLOW(T) = { +, ), $ }
FOLLOW(F)  = FIRST(T') ∪ FOLLOW(T) = { *, +, ), $ }

Step 2: LL(1) Table

Non-Terminal	id	(	+	*	)	$
E	E → T E'	E → T E'
E'			E' → + T E'		E' → ε	E' → ε
T	T → F T'	T → F T'
T'			T' → ε	T' → * F T'	T' → ε	T' → ε
F	F → id	F → ( E )

3.4 Predictive Parsing

Predictive Parser is a non-backtracking recursive descent parser that uses the LL(1) table to parse the input.

Steps:

Start with a stack containing $ and the start symbol.
Read the input symbol.
Use the parsing table to replace the top of the stack.
Repeat until the stack matches the input or an error occurs.

Algorithm:


initialize stack with [$, StartSymbol]
repeat:
    X ← top of stack
    a ← current input token
    if X is terminal or $:
        if X == a:
            pop X and advance input
        else:
            error
    else:
        if table[X, a] = X → Y1 Y2 ... Yn:
            pop X
            push Yn ... Y1
        else:
            error
until stack is empty

Example:

Parsing: id + id * id


Stack              Input             Action
$ E                id + id * id $    Use E → T E'
$ E' T             id + id * id $    Use T → F T'
$ E' T' F          id + id * id $    Use F → id
$ E' T' id         id + id * id $    Match id
$ E' T'            + id * id $       T' → ε
$ E'               + id * id $       E' → + T E'
... (continues)

4. LR(1) Grammar and DCFG

LR(1) grammars belong to the family of Deterministic Context-Free Grammars (DCFGs) and are used in bottom-up parsing. Unlike LL(1) (top-down), LR(1) works by reconstructing the rightmost derivation in reverse. These grammars are powerful enough to parse complex programming languages like C/C++ with precision.

4.1 Bottom-Up Parsing

Bottom-Up Parsing starts from the input string and tries to build the parse tree up to the start symbol. It works using a Shift-Reduce mechanism:

Shift: Push the next input symbol onto the stack.
Reduce: Replace the handle (a right-hand side of a production) with its corresponding non-terminal.

Example: Given grammar:


S → a A d
A → b c

Parsing a b c d proceeds as:


Stack       Input        Action
$           a b c d $    Shift
$ a         b c d $      Shift
$ a b       c d $        Shift
$ a b c     d $          Reduce b c → A
$ a A       d $          Shift
$ a A d     $            Reduce a A d → S
$ S         $            Accept

4.2 Canonical LR(1) vs SLR(1) vs LALR(1)

4.2.1 Canonical LR(1)

Most powerful LR parser.
Uses item sets with full 1-symbol lookahead.
Large parsing tables (can reach thousands of states).


Item: [A → α • β, a]

Means: Currently in the middle of rule A → αβ, we’ve parsed α, expect β next, and ‘a’ is a lookahead terminal to help resolve reduce conflicts.

4.2.2 SLR(1): Simple LR

Based on LR(0) items + FOLLOW sets.
Compact but less powerful.
Fails for some grammars where Canonical LR(1) works.

4.2.3 LALR(1): LookAhead LR

Merges states with same core (LR(0) items), even if lookahead differs.
Same number of states as SLR(1), but more precise due to 1-symbol lookahead.
Used in real-world tools like YACC, Bison.

Parser	Item Type	Lookahead	State Count	Power
Canonical LR(1)	LR(1)	✓	High	Highest
SLR(1)	LR(0)	✗ (uses FOLLOW)	Low	Low
LALR(1)	LR(1) merged	✓	Medium	Medium (Practical)

4.3 Parsing Table Conflicts

Conflicts occur when the parser can't decide between Shift and Reduce, or between multiple Reduces.

4.3.1 Shift-Reduce Conflict

Occurs when parser can either shift or reduce at a particular state and input symbol.

Example:


if E then S else S

Ambiguity: does else belong to inner or outer if?

4.3.2 Reduce-Reduce Conflict

Occurs when multiple productions are valid reductions at same point.

Example: Grammar:


S → A | B
A → a
B → a

On input a, the parser cannot choose whether to reduce using A or B.

4.4 Real-World Relevance

Why use LR(1) or LALR(1)?

Can handle more complex grammars than LL(1), including left recursion.
Real-world compilers for C, C++, Java, Rust use LALR(1).
Bison/YACC: Tools to auto-generate LALR(1) parsers from grammar specs.
Allows efficient, deterministic parsing with no backtracking.

Example: In C language parsing:


// Grammar Snippet:
declaration → type id ;
type → int | float | char

Using LALR(1), we can efficiently and correctly parse these using parsing tables even with ambiguous symbols like * in declarations.

5. Applications and Limitations of DCFGs

Deterministic Context-Free Grammars (DCFGs) are the foundation of real-world compiler syntax analysis. They represent the class of languages that can be parsed deterministically using a single pass and a stack — critical for performance in programming languages. However, they have boundaries and trade-offs in grammar expressiveness.

5.1 Use in Real-World Compilers

DCFGs are central to efficient parsing in modern programming languages. Their deterministic nature allows for predictable and fast parser behavior.

Programming Language Parsers: Most languages use DCFGs for syntax parsing — C, C++, Java, Python, Go, Rust.
Parser Generators: Tools like YACC, Bison, ANTLR, JavaCC depend on DCFGs (LL(1), LALR(1), etc.).
IDE Syntax Highlighting & Auto-completion: Use predictive parsing based on LL(1) variants.
Syntax-Directed Translation: Intermediate code generation relies on deterministic structure from DCFGs.
Code Formatting Tools: Use LL-based parsers for checking indentation and blocks.

Real-world Example: A C compiler uses a LALR(1) parser (a DCFG parser) to read code like int x = 10; and translate it to machine-level intermediate representation using only one token of lookahead.

5.2 Limitations of DCFGs

DCFGs are not powerful enough for every aspect of a language. Some limitations include:

Cannot handle ambiguous grammars directly (like natural language).
Cannot parse context-sensitive constructs like:
- Variable declarations vs usage
- Type checking
- Indentation-based syntax (e.g., Python)
- Matching nested typedef names or symbol table lookups
Cannot detect semantic errors: e.g., using an undeclared variable.
Some grammars are deterministic but require more than one lookahead symbol (LL(k) or LR(k)) → increased complexity.

Example: The famous if-then-else ambiguity cannot be resolved using simple LL(1) without modifying the grammar.

5.3 Trade-offs: Simplicity vs Power

Feature	LL(1)	LALR(1)	Full CFG
Speed	Fastest	Fast	Slow (may require backtracking)
Memory Use	Low	Moderate	High
Grammar Complexity	Low	Moderate	High
Ambiguity Handling	✗	✗	✓ (if allowed)
Use in Compilers	Simple languages, expression evaluators	Most real-world compilers	Rarely used in compilers directly

Designers choose DCFGs (LL/LR variants) because:

They are efficient to parse.
They strike a balance between expressiveness and determinism.
They integrate well with semantic analyzers and code generators.

In practice, complex features not expressible in DCFG are handled later during semantic analysis or by refactoring the grammar.