Lexical Analysis - CSU086 | Shoolini University

Lexical Analysis (Scanner) Phase of Compiler

What is Lexical Analysis?

Lexical Analysis is the first phase of a compiler where raw source code is scanned and broken into meaningful units called tokens. It reads the program character by character, grouping them into lexemes (words) and assigning them token classifications. Think of it like breaking a paragraph into words and punctuation before analyzing its meaning.

βœ… Key Role: Converts messy human-written code into structured components for easier processing by later compiler stages.

πŸ”Ž Example:
Consider the source code:

int age = 25;

After lexical analysis, it becomes:

graph TD; A[Source Code] -->|Character Stream| B["Lexical Analysis (Scanner)"] B -->|Token Stream| C["Syntax Analysis (Parser)"] %% Error Handling B -.->|Lexical Errors: Unrecognized Symbols| Error1[Error Reporting] C -.->|Syntax Errors: Incorrect Grammar| Error2[Error Reporting] %% Supporting Components B --> ST["Symbol Table (Stores Identifiers, Keywords)"] B --> FA["Finite Automata (Token Recognition)"] C --> ST
Figure: How the source code is broken down into tokens and structured into a parse tree.

Why is Lexical Analysis Needed?

Lexical analysis is necessary because computers cannot understand raw programming languages the way humans do.
Without Lexical Analysis, the compiler would struggle with:

  1. Ignoring unnecessary elements like spaces, tabs, and comments.
  2. Efficiently recognizing words (tokens) instead of scanning character by character every time.
  3. Detecting errors early in the compilation process.

πŸ”Ž Analogy:
Imagine a grocery list written with extra spaces, comments, and unnecessary characters:

# Buy these items
    Milk     1ltr     // Get fresh milk
    Eggs     12 // Buy eggs with extra space
            

Lexical analysis cleans this up to just:

This helps in further processing the data efficiently.

Components of Lexical Analysis

Lexical analysis has three core components:

1️⃣ Lexeme: A sequence of characters that matches a pattern and forms a meaningful unit.

2️⃣ Token: A classification assigned to a lexeme (like a label).

3️⃣ Pattern: A rule or regular expression that defines how lexemes are recognized.

πŸ”Ž Example Breakdown:
For this code:

float price = 99.99;
Lexeme Token Type Pattern
float Keyword int
price Identifier [a-zA-Z_][a-zA-Z0-9_]*
= Operator =
99.99 Number [0-9]+(\.[0-9]+)?
; Delimiter ;

Steps in Lexical Analysis

Lexical analysis works as an automated scanner that processes source code step-by-step.

πŸ“Œ Step 1: Read Characters from Input

πŸ“Œ Step 2: Recognize Lexemes and Match Patterns

πŸ“Œ Step 3: Convert Lexemes into Tokens

πŸ“Œ Step 4: Ignore Whitespaces and Comments

πŸ“Œ Step 5: Handle Errors

πŸ”Ž Example Walkthrough (Processing int x = 10;)

Step Character Read Recognized Lexeme Token Output
1 i int Keyword
2 x x Identifier
3 = = Operator
4 1 10 Number
5 ; ; Delimiter

Handling Whitespaces, Newlines, and Comments

Lexical analyzers automatically remove unnecessary characters before further processing.

βœ… Whitespaces: Skipped unless inside string literals ("Hello World").
βœ… Newlines: Ignored unless used for formatting (Python indentation).
βœ… All Comments: Entirely removed before tokenization.

πŸ”Ž Example (Before & After Lexical Analysis)

Source Code (Before Processing)
// This is a comment
int a = 5;    // Variable declaration
float b = 3.14;
After Lexical Analysis (Cleaned Up)
int a = 5;
float b = 3.14;

Example: Implementing a Simple Lexical Analyzer in Python

This Python script simulates a basic lexer using regular expressions.

import re

# Define token patterns
token_patterns = [
    (r'int|float|char', 'KEYWORD'),
    (r'[a-zA-Z_][a-zA-Z0-9_]*', 'IDENTIFIER'),
    (r'[0-9]+(\.[0-9]+)?', 'NUMBER'),
    (r'=', 'OPERATOR'),
    (r';', 'DELIMITER'),
    (r'\s+', None)  # Ignore whitespaces
]

def lexical_analyzer(code):
    tokens = []
    while code:
        matched = False
        for pattern, token_type in token_patterns:
            regex = re.compile(pattern)
            match = regex.match(code)
            if match:
                lexeme = match.group(0)
                if token_type:  # Ignore whitespaces
                    tokens.append((lexeme, token_type))
                code = code[len(lexeme):]  # Move to next lexeme
                matched = True
                break
        if not matched:
            raise ValueError(f"Unexpected character: {code[0]}")
    return tokens

# Test lexer
code_sample = "int age = 25;"
tokens = lexical_analyzer(code_sample)
for lexeme, token in tokens:
    print(f"{lexeme} -> {token}")

βœ… Output:


int -> KEYWORD
age -> IDENTIFIER
= -> OPERATOR
25 -> NUMBER
; -> DELIMITER

Tokenization

Tokenization is the process of breaking down a source program into a sequence of meaningful tokens. Each token is a fundamental unit of the language, representing a logical component like a keyword, identifier, or operator.

Token Structure

Each token consists of:

Types of Tokens

Tokens are classified into different categories:

Keywords

Keywords are reserved words in a programming language that have special meanings and cannot be used as identifiers.

Identifiers

Identifiers are user-defined names for variables, functions, and objects.

Literals

Literals are fixed values assigned to variables, such as numbers, characters, and strings.

Operators

Operators perform computations and comparisons. They are categorized as:

Special Symbols

Special symbols are punctuation marks or delimiters that structure the code.

Tokenization Process

The lexical analyzer (scanner) reads the source program character by character and groups them into tokens using:

Example Tokenization

For the input code:

int sum = 42 + 8;

The lexer produces tokens:

Lexeme Token Name Attribute
int KEYWORD int
sum IDENTIFIER Pointer to symbol table
= OPERATOR =
42 LITERAL Integer
+ OPERATOR +
8 LITERAL Integer
; SPECIAL_SYMBOL Statement Terminator

Challenges in Tokenization

Regular Expressions & Finite Automata

Lexical analyzers use Regular Expressions (RE) and Finite Automata (FA) to define and recognize token patterns in programming languages. These mathematical models efficiently describe how characters are grouped into meaningful tokens, ensuring correct identification of keywords, literals, identifiers, and operators.

Regular Expressions (RE)

A Regular Expression (RE) is a sequence of characters that defines a search pattern. It is a formal way to describe patterns in strings and is widely used in lexical analysis.

Components of Regular Expressions

Example Regular Expressions

Pattern Description Example Matches
[a-zA-Z_][a-zA-Z0-9_]* Valid Identifiers var_name, _temp, count1
[0-9]+ Integer Numbers 42, 123, 9999
[0-9]*\.[0-9]+ Floating-point Numbers 3.14, 0.5, .75
"([^"\\]|\\.)*" String Literals "hello", "C Compiler"
(if|while|for|return) Keywords if, while, return
\+\+|\-|\*|\/ Operators ++, -, *, /

Finite Automata (FA)

Finite Automata are abstract machines used to implement lexical analyzers that recognize patterns defined by regular expressions. There are two main types:

Deterministic Finite Automata (DFA)

A Deterministic Finite Automaton (DFA) is a state machine where each state has exactly one transition per input symbol.


stateDiagram
    [*] --> S0
    S0 --> S1 : [a-zA-Z_]
    S1 --> S1 : [a-zA-Z0-9_]
    S1 --> [*] : Accept (Identifier)

Explanation:

Nondeterministic Finite Automata (NFA)

A Nondeterministic Finite Automaton (NFA) allows multiple transitions for the same input, meaning multiple states can be active at once.


graph TD
    A[Start] -->|a-z, A-Z, _| B
    B -->|a-z, A-Z, 0-9, _| B
    B -->|Ξ΅| Accept

Converting NFA to DFA

An NFA can be converted into an equivalent DFA using the Subset Construction Algorithm. This process removes ambiguity by constructing deterministic states from multiple NFA states.

NFA vs. DFA: Comparison

Feature NFA DFA
Number of transitions per state Multiple allowed One per input
Backtracking Possible Not required
Speed Slower Faster
Memory Usage Lower Higher
Conversion Complexity Simple to construct Requires subset construction

Using FA in Lexical Analysis

Industry Applications

Lexical Errors

Lexical errors occur when an input sequence of characters does not match any valid token pattern defined by the programming language's lexical rules. These errors are detected at the earliest stage of compilation (lexical analysis) and can prevent further processing.

Importance of Detecting Lexical Errors

Causes of Lexical Errors

Lexical errors arise due to:

Common Types of Lexical Errors

Illegal Characters

Characters that are not part of the programming language's character set result in lexical errors.

Examples
Invalid Identifiers

Identifiers must follow specific naming conventions:

Examples

// Incorrect
int 1value = 10;  // Error: Identifier cannot start with a digit

// Incorrect
int if = 5;  // Error: 'if' is a reserved keyword
Unclosed String Literals

String literals must start and end with the correct delimiter, such as " or '. If the closing delimiter is missing, a lexical error occurs.

Examples

// Incorrect: Missing closing quote
printf("Hello world);  // Error: Unterminated string literal

// Incorrect: Unmatched single quote
print('Welcome)  // Error: Unclosed string literal
Invalid Numeric Literals

Numbers must follow proper formatting rules.

Examples
Invalid Escape Sequences

Escape sequences are used inside string literals to represent special characters, but an incorrect escape sequence can cause a lexical error.

Examples

// Incorrect: \q is not a valid escape sequence
System.out.println("Hello \q World");  // Error

// Incorrect: \U starts a Unicode sequence but is incomplete
print("C:\Users\John")  // Error: Incomplete Unicode escape sequence
Unterminated Comments

Multi-line comments must be properly closed; otherwise, the lexer will reach the end of the file without finding the closure.

Examples

/* This is a comment 
printf("Hello, World!");  // Error: Missing '*/'

Detecting and Handling Lexical Errors

The lexical analyzer (lexer) detects errors using Finite Automata and follows these strategies:

Example: Lexical Error Messages

int 1number = 10;
System.out.println("Hello World);

The compiler output may be:


Error: Invalid identifier '1number' (line 1, column 5)
Error: Unterminated string literal (line 2, column 27)

Real-World Applications

Summary of Lexical Errors

Lexical Error Type Description Example
Illegal Characters Characters not in the language. int x = 10 @ 5; (Error: '@')
Invalid Identifiers Incorrect variable names. int 1var = 5; (Error: Cannot start with a digit)
Unclosed Strings Missing string delimiters. print("Hello; (Error: Unterminated string)
Invalid Numbers Incorrect numeric formats. float x = 3..14; (Error: Double dots)
Unterminated Comments Comment block not closed. /* This is a comment (Error: Missing '*/')

Lexical Error Recovery Techniques

Lexical error recovery ensures that minor errors in the source code do not halt the compilation process completely. Instead of stopping execution on encountering an invalid token, modern compilers attempt to recover and continue processing the remaining code. This improves robustness and provides useful feedback to the programmer.

Types of Lexical Errors

Lexical errors occur when a sequence of characters does not match any valid token pattern defined by the language. Common lexical errors include:

Panic Mode Recovery

The simplest and most commonly used method where the lexer discards characters until a valid token is found.

Example

int x = 5 @ 10;

In this case, `@` is an illegal character in C.

Use Case

Error Token Insertion

The lexer replaces an invalid token with a placeholder (e.g., `<TOKEN_ERROR>`) and continues scanning.

Example
1var = 10

In Python, identifiers cannot start with a digit.

Use Case

Symbol Table Correction

For minor lexical mistakes, the lexer may suggest corrections based on the symbol table.

Example

whil (x < 10) {
    System.out.println(x);
}

The keyword `whil` is misspelled, but the lexer detects it as a probable `while`.

Use Case

Lookahead Correction

The lexer looks ahead to check if an error can be automatically corrected.

Example

int x = 5.5.3;

This is invalid because `5.5.3` is not a valid floating-point number.

Use Case

Logging & Reporting

Modern compilers maintain error logs instead of immediately stopping execution on lexical errors.

Example

In VS Code or IntelliJ, an incomplete statement like:


console.log("Hello;

Shows a warning: `Unclosed string literal` but allows further typing.

Use Case

Real-World Applications

Summary

Recovery Technique Action Taken Example
Panic Mode Recovery Skip invalid tokens until a valid token is found. `int x = 5 @ 10;` β†’ Skips `@`.
Error Token Insertion Replace invalid tokens with `<TOKEN_ERROR>`. `1var = 10` β†’ `<TOKEN_ERROR> = 10`.
Symbol Table Correction Suggests possible corrections for mistyped keywords. `whil (x < 10)` β†’ Suggests `while`.
Lookahead Correction Splits ambiguous tokens correctly. `int x = 5.5.3;` β†’ Splits into `5.5` and `.3`.
Logging & Reporting Logs lexical errors without stopping execution. VS Code highlights `console.log("Hello;` as an error.

Real-World Applications

Lexical Analysis is widely used beyond compiler design. Many real-world systems and software applications depend on efficient tokenization, pattern recognition, and lexical scanning to process and understand textual data. Below are some key areas where lexical analysis plays a crucial role.

Code Syntax Highlighting

Lexical analysis is used in Integrated Development Environments (IDEs) and text editors to recognize different programming constructs and apply appropriate syntax highlighting.

How it works:
Examples:
Example:

Consider this Python snippet in an IDE:


def greet(name):
    print("Hello, " + name + "!")

The lexer categorizes and highlights:

Keyword Extraction in Search Engines

Search engines such as Google and Bing use lexical analysis to extract meaningful words from user queries and match them against indexed web pages.

How it works:
Examples:
Example:

Consider a user searching for:


"Best places to visit in Paris during winter"

The lexer processes it as:

Command-Line Interpreters

Lexical analyzers are essential in command-line interpreters like Bash, PowerShell, and Zsh to process user commands efficiently.

How it works:
Examples:
Example:

Consider this Linux command:


grep "error" logfile.txt | sort | uniq -c

The lexer tokenizes the input as:

Additional Real-World Applications

Common Pitfalls & Debugging

Lexical analysis, while crucial for efficient compilation, has several challenges that can cause incorrect tokenization, performance issues, or incompatibility with modern programming needs. Below are common pitfalls in lexical analysis and their debugging strategies.

Handling Whitespace & Comments Efficiently

Whitespace and comments do not contribute to program execution but must be handled carefully during lexical analysis to avoid inefficiencies.

Problems Caused by Whitespace & Comments
Strategies for Efficient Handling
Example Issue & Debugging

Consider this JavaScript code:


let x =  10;  // This is a comment
console.log("Value: " + x);

If the lexer fails to remove comments properly, the comment may interfere with parsing. A debugging strategy involves:

Resolving Ambiguous Tokens

Lexers must correctly distinguish between tokens that share prefixes (e.g., <= vs. < =).

1. Why Ambiguity Occurs
Example of Token Ambiguity

Consider this C++ expression:


if (x <= y)
if (x < = y)

The first line correctly recognizes <= as a single token, but the second line has ambiguity:

Debugging Strategy
Solution in a DFA

To resolve `<=` vs. `<=`, the DFA transitions should be structured as:


stateDiagram
    [*] --> LESS : '<'
    LESS --> LESSEQUAL : '='
    LESS --> [*] : Accept '<'
    LESSEQUAL --> [*] : Accept '<='

Dealing with Unicode & Multi-Language Lexing

Modern compilers need to support Unicode characters, which introduces complexity in tokenization.

Challenges in Unicode Handling
Examples of Unicode in Identifiers

Valid variable names in Python:


ε˜ι‡ = 100  # Chinese variable name
Ο€ = 3.14    # Greek letter as identifier
Debugging Strategies
Example Regex for Unicode Identifiers

[\p{L}_][\p{L}\p{N}_]*

This allows letters from any language in identifiers.

Real-World Applications

Advanced Optimization for Industry Use

Optimizing lexical analysis is essential in industry applications where performance, memory efficiency, and speed are critical. Techniques such as DFA minimization and Just-In-Time (JIT) lexing help reduce overhead and improve real-time execution in modern compilers and interpreters.

DFA Minimization for Fast Scanning

Deterministic Finite Automaton (DFA) minimization is an optimization technique that reduces the number of states in a lexical analyzer, making token recognition faster and more efficient.

Why DFA Minimization Matters
Steps in DFA Minimization

The Hopcroft's Algorithm is widely used for minimizing DFAs:

Example: Minimizing a DFA

Consider a DFA that recognizes the keywords "if" and "int".

Before Minimization:

stateDiagram
    [*] --> S0
    S0 --> S1 : 'i'
    S1 --> S2 : 'f'
    S1 --> S3 : 'n'
    S3 --> S4 : 't'
    S2 --> [*] : Accept 'if'
    S4 --> [*] : Accept 'int'
After Minimization:

stateDiagram
    [*] --> S0
    S0 --> S1 : 'i'
    S1 --> S2 : 'f' (if) / 'n' (int)
    S2 --> [*] : Accept (if/int)

By merging states S3 and S4 into S2, we reduce the number of transitions, improving performance.

Real-World Applications

Implementing Just-In-Time (JIT) Lexing for Performance

Just-In-Time (JIT) lexing is an on-demand tokenization technique where lexing occurs only when required, rather than preprocessing the entire source code in advance.

Why JIT Lexing is Useful
How JIT Lexing Works
Example: Traditional vs. JIT Lexing
Traditional Lexing (Static Compilation):

int main() {
    int x = 10;
    if (x > 5) {
        printf("Hello");
    }
}

The traditional lexer scans all tokens before parsing begins.

JIT Lexing (Dynamic Interpretation):

if condition():
    execute()

JIT lexing only tokenizes the if-statement when `condition()` is called.

Real-World Implementations