1. Introduction to Parsers
A parser is the second phase of a compiler that receives tokens from the lexical analyzer and builds a structured representation of the source code, typically a parse tree.
1.1 What is a Parser?
A parser is a system component that checks whether the sequence of tokens from the lexical analyzer form a syntactically valid program according to the rules of a programming language defined by a grammar (typically a context-free grammar).
Think of the parser as a grammar-checking assistant for a programmer — just like Microsoft Word underlines syntactic mistakes in sentences, a parser flags invalid code constructs.
There are two main types of parsers:
- Top-Down Parsers: Start from the root (start symbol) and try to rewrite it to match the input string.
- Bottom-Up Parsers: Start from the input string and try to reduce it to the start symbol.
graph TD A[Parsers] --> B[Top-Down Parsers] A --> C[Bottom-Up Parsers] %% Top-Down Parsers Breakdown B --> B1[Recursive Descent Parser] B1 --> B1a[Backtracking Parser] B1 --> B1b[Non-Backtracking Parser] B --> B2[Predictive Parser] B2 --> B2a["LL(1) Parser"] B2 --> B2b["LL(k) Parser"] B2 --> B2c[Table-Driven Predictive Parser] %% Bottom-Up Parsers Breakdown C --> C1[Shift-Reduce Parser] C1 --> C2[LR Parsers] C2 --> C2a["SLR (Simple LR) Parser"] C2 --> C2b["Canonical LR(1) Parser"] C2 --> C2c[LALR Parser] C1 --> C3[Operator Precedence Parser] C1 --> C4[Handle Pruning Parser] %% Annotations for Conceptual Understanding B1a:::note --> Z1[Allows backtracking to try alternatives] B2a:::note --> Z2[Uses single-token lookahead and parsing table] C2b:::note --> Z3[Most powerful LR parser with full lookahead] C2c:::note --> Z4[Compromise between power and efficiency; used in YACC] C3:::note --> Z5[Works only for operator grammars with no ambiguities]
1.2 Why is a Parser Used?
A parser is used to:
- Verify syntax: Ensure that the input source code conforms to the grammatical rules of the language.
- Generate parse tree: Provide a structured representation that guides later phases like semantic analysis, optimization, and code generation.
- Handle syntax errors: Detect and recover from invalid code structures.
Real-world scenario: Imagine an online tax form. You can only submit it if all fields follow a specific format. The parser in a compiler ensures "code submission" is in the correct format so it doesn't crash the system during execution.
1.3 Advantages and Disadvantages of Parsers
1.3.1 Advantages
- Syntax checking: Detects errors early in the compilation process.
- Structure extraction: Helps convert raw tokens into a tree-like structure for further analysis.
- Automation via tools: Parsers can be generated automatically (e.g., using Yacc or Bison).
- Supports modular compiler design: Separates syntax verification from other compiler phases.
1.3.2 Disadvantages
- Grammar restrictions: Not all grammars can be parsed by all parsers (e.g., left-recursion is not allowed in LL parsers).
- Error recovery complexity: Writing robust parsers that recover well from syntax errors is difficult.
- Ambiguity handling: Requires techniques like precedence rules or grammar refactoring to resolve ambiguous constructs.
Aspect | Top-Down Parsing | Bottom-Up Parsing |
---|---|---|
Direction | From root to leaves | From leaves to root |
Common Parsers | Recursive-descent, LL(1) | LR(0), SLR, LALR |
Use in Industry | Hand-written parsers | Tool-generated parsers (e.g., Yacc) |
Example (Expression Grammar):
E → E + T | T
T → T * F | F
F → ( E ) | id
This grammar defines how arithmetic expressions are parsed into meaningful operations.
1.4 Architecture of a Parser
The parser is a core component of the compiler, sitting between the lexical analyzer and the semantic analyzer. Its architecture ensures structured validation and interpretation of the input tokens according to the grammar of the language.
1.4.1 Key Components
- Input: Stream of tokens from the lexical analyzer.
- Parsing Engine: Uses grammar rules to build the structure (parse tree).
- Syntax Tree / Parse Tree: Represents the hierarchical structure of the source code.
- Error Handler: Detects and reports syntax errors with possible recovery strategies.
- Output: Parse tree for semantic analysis.
1.4.2 Functional Flow
As tokens flow into the parser, they are matched against grammar rules. If valid, a node is added to the parse tree. Otherwise, the error handler manages the syntax error based on recovery techniques like panic mode or phrase-level correction.
1.4.3 Mermaid Diagram: Parser Architecture
graph TD A[Source Code] --> B[Lexical Analyzer] B --> C[Tokens Stream] C --> D[Parser] D --> E1[Parse Tree / Syntax Tree] D --> E2[Error Handler] E1 --> F[Semantic Analyzer] %% Feedback Loop E2 -->|Syntax Errors| D %% Extra Notes classDef block fill:#f9f,stroke:#333,stroke-width:2px; class A,B,C,D,E1,E2,F block;
2. Top-Down Parsers
Top-down parsers begin constructing the parse tree from the start symbol (root) and proceed by expanding the leftmost nonterminal until they match the entire input string.
They simulate a leftmost derivation and are used in tools like syntax checkers, compilers, and interpreters where early error detection and simple parsing logic are desirable.
2.1 Recursive Descent Parsing
This is a top-down parser built from a set of mutually recursive procedures, where each procedure implements one of the nonterminals of the grammar.
It is conceptually similar to reading a storybook where each chapter calls another until the story ends or a contradiction (error) arises.
2.1.1 Characteristics
- Manual implementation: Easy to write and debug.
- Backtracking: May be required unless grammar is suitable for predictive parsing.
- Grammar requirement: Must eliminate left recursion and be left-factored.
2.1.2 Example
Consider the grammar:
S → c A d
A → a b | a
Recursive descent approach for this:
void S() {
if (lookahead == 'c') {
match('c'); A(); match('d');
} else error();
}
void A() {
if (lookahead == 'a') {
match('a');
if (lookahead == 'b') match('b'); // for A → a b
// else match only a
} else error();
}
2.2 Predictive Parsing (LL(1))
Predictive parsers are a special case of recursive-descent parsers that do not require backtracking.
They use lookahead (typically 1 symbol) and a parsing table to choose the correct production rule based on the current nonterminal and lookahead symbol.
2.2.1 Table-Driven Algorithm
Uses an explicit stack and parsing table:
Input: string w and parsing table M
Output: leftmost derivation or error
initialize stack with [$, S]
while (stack not empty):
X = top of stack
a = current input symbol
if X == a:
pop stack, advance input
else if X is terminal:
error()
else if M[X, a] == X → Y1Y2…Yk:
pop stack
push Yk…Y1 onto stack
else:
error()
2.2.2 Example Grammar
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → (E) | id
Parsing id + id * id
results in:
Stack Input Action
E$ id+id*id$ E → T E'
T E'$ id+id*id$ T → F T'
F T' E'$ id+id*id$ F → id
id T' E'$ id+id*id$ match id
T' E'$ +id*id$ T' → ε
E'$ +id*id$ E' → + T E'
+ T E'$ +id*id$ match +
T E'$ id*id$ ...
2.3 Advantages and Disadvantages
2.3.1 Advantages
- Simplicity: Easy to implement by hand, especially for small grammars.
- Error detection: Early syntax error reporting due to top-down nature.
- Direct correlation: Code closely follows grammar rules.
2.3.2 Disadvantages
- Grammar restrictions: Cannot handle left-recursive grammars.
- Lookahead limitation: LL(1) requires disjoint FIRST sets; otherwise, predictive parsing table fails.
- Backtracking overhead: Without lookahead tables, recursive descent can become inefficient.
2.4 Real-World Scenario
Consider voice-command interfaces like "Turn on the fan and light." The parser needs to understand command structure using grammar like:
Cmd → Action Object
Action → turn on | switch on
Object → fan | light | fan and light
A top-down parser allows us to quickly construct this interpretation with clear rules and minimal ambiguity.
3. Bottom-Up Parsers
Bottom-up parsers build the parse tree from leaves (tokens) toward the root (start symbol). Instead of predicting what to do next, they reduce recognized patterns to non-terminals — this is called Shift-Reduce Parsing.
They simulate a rightmost derivation in reverse and are widely used in parser generators like Yacc, Bison, etc.
3.1 Shift-Reduce Parsing
Shift-reduce parsing is the foundation of all bottom-up parsers. It operates using a stack and two basic operations:
- Shift: Push the next input symbol onto the stack.
- Reduce: Replace a sequence of stack symbols matching the right-hand side of a grammar rule with its left-hand side non-terminal.
3.1.1 Handle
A handle is a substring matching the right-hand side of a production that can be reduced. Think of it as the most immediate piece of a sentence that can be "rewritten" in reverse using grammar rules.
3.1.2 Example
Grammar:
E → E + T | T
T → T * F | F
F → (E) | id
Input: id * id
Reductions:
id * id
→ F * id (F → id)
→ T * id (T → F)
→ T * F (F → id)
→ T (T → T * F)
→ E (E → T)
This reverse derivation builds the parse tree bottom-up.
3.2 LR Parsers (Left-to-right, Rightmost derivation)
LR parsers are powerful bottom-up parsers. They decide when to shift or reduce based on states and lookahead tokens.
3.2.1 Types of LR Parsers
- SLR (Simple LR): Uses FOLLOW sets. Easy but less powerful.
- LR(1): Full power with 1-token lookahead.
- LALR: Merges similar states from LR(1) to reduce table size. Used by most parser generators (e.g., Yacc).
3.2.2 Components
- Stack: Holds states and grammar symbols.
- Input Buffer: Stores the string to be parsed.
- Parsing Table: Consists of ACTION and GOTO tables.
3.2.3 Parsing Table Actions
- Shift s: Push state
s
and input symbol. - Reduce A → α: Pop |α| symbols and states, push
A
. - Accept: Parsing successful.
- Error: Input does not match grammar.
Stack Input Action
0 id*id$ shift 5
0 5 *id$ reduce F → id
0 3 *id$ reduce T → F
0 2 *id$ shift 7
0 2 7 id$ shift 5
0 2 7 5 $ reduce F → id
0 2 7 3 $ reduce T → T * F
0 2 $ reduce T → T * F
0 $ reduce E → T
0 1 $ accept
This is the working of a shift-reduce LR parser.
3.3 Advantages and Disadvantages
3.3.1 Advantages
- Handles a larger class of grammars: Including left-recursive and ambiguous-looking ones.
- Tool-friendly: Automatic tools like Yacc can generate LR parsers.
- Error detection is robust: Errors are caught closer to the actual point of error in input.
3.3.2 Disadvantages
- Complexity: LR tables are hard to construct by hand.
- Size: Canonical LR tables can be very large (many states).
- Opaque logic: Harder to debug and trace than recursive-descent parsers.
3.4 Real-World Scenario
Imagine a barcode scanner in a supermarket. Every scanned symbol adds to the queue (shift). Once the pattern matches a known product (handle), it reduces to that product. Bottom-up parsing is like recognizing objects from pieces as they are assembled.