Code Optimization Code Generation

Introduction to Code Optimization and Code Generation

Code generation is the compiler phase that translates an intermediate representation (e.g. three-address code) into target machine code. Code optimization is the preceding phase that transforms the IR to run faster or use less space without changing its meaning. Example: Starting from

t1 = b * c  
t2 = b * c 
x = t1 + t2

an optimizer would merge the repeated multiplication, yielding

t1 = b * c  
x = t1 + t1

so the code generator emits fewer instructions.

Basic Blocks & Flow Graphs

A basic block is a maximal sequence of consecutive statements or instructions in which control enters at the beginning and leaves at the end without halting or branching except at the end. To identify basic blocks, you mark:

Entry leaders: the first instruction, any target of a jump, and any instruction following a jump.
Boundaries: each leader begins a new block, which ends just before the next leader.

A control-flow graph (CFG) represents these blocks as nodes and adds directed edges for possible transfers of control (e.g., fall-through, conditional and unconditional jumps). CFGs are the foundation for data‐flow and optimization analyses.

Example:

1: if (x < 0) goto L2  
2: y = x + 1  
3: goto L3  
L2: y = 0  
L3: print(y)

Blocks:
- B1 = {1}
- B2 = {2, 3}
- B3 = {L2}
- B4 = {L3}
Edges: B1→B2 (when x≥0), B1→B3 (when x<0), B2→B4, B3→B4.

This CFG enables optimizations like dead‐code elimination and live‐variable analysis by tracking how data flows between blocks.

Directed Acyclic Graphs (DAGs)

A DAG for a basic block is a graph with no cycles that captures computations and shares common sub-expressions. Nodes represent either operators (e.g., +, *) or leaf operands (variables/constants). Edges point from operator nodes to their operand nodes. By reusing existing nodes for identical operations, a DAG uncovers and removes redundant computations within the block.

Building a DAG:

For each operand, create or reuse its node.
For each operation, check if a node with the same operator and operand pointers exists—if yes, reuse it; otherwise, create a new node.
Record the mapping from variables to nodes.

Example:

t1 = b + c  
t2 = b + c  
t3 = t1 * t2

The DAG has:

One ‘+’ node with children b, c.
One ‘*’ node whose two children both point to that same ‘+’ node. Code generated from this DAG evaluates b+c only once, eliminating the duplicate addition.

Principle Sources of Optimization: Loop Optimization

Loops often dominate execution time, so compilers focus on minimizing work inside loop bodies and reducing control overhead. Principal loop‐focused optimizations include:

Loop-Invariant Code Motion: Move calculations whose results don’t change across iterations to just before the loop.
Induction-Variable Elimination: Replace arithmetic on loop-index‐derived variables with simpler updates.
Common Sub-expression Elimination: Compute repeated expressions once and reuse the result.
Loop Unrolling: Duplicate the loop body several times to reduce the number of branch and test instructions.
Loop Fusion (Jamming): Merge adjacent loops with the same bounds to improve data locality and reduce overhead.

Applying these reduces instruction counts, cut down on pipeline stalls, and often enhances cache performance, yielding significant speedups for tight loops.

Eliminating Induction Variable

An induction variable changes by a fixed amount each loop iteration, typically used in address computations or simple counters. Multiplications or complex calculations on it can be replaced by simpler increments to reduce expensive operations.

Technique:

Introduce a new variable initialized outside the loop.
Update it by a constant addition each iteration instead of recomputing via multiplication.

Example:

// Original
for (i = 0; i < n; i++) {
  addr = base + 4*i;
  A[i] = M[addr];
}

// After eliminating induction variable
p = base;
for (i = 0; i < n; i++) {
  A[i] = M[p];
  p += 4;
}

Here, computing 4*i inside the loop (a multiplication) is replaced by a simple p += 4 addition, reducing the per‐iteration cost and improving performance.

Eliminating Common Sub-expression

A common sub-expression is an expression repeated more than once with invariant operands. Computing it only once and reusing the result avoids redundant work.

Technique:

Introduce a temporary to hold the repeated expression.
Replace subsequent occurrences with the temporary.

Example:

// Original
x = a + b + c;
y = a + b - d;

// After elimination
t = a + b;      // common sub-expression
x = t + c;
y = t - d;

The addition a + b is calculated only once instead of twice. This both reduces instruction count and can enable further optimizations like code motion if placed outside a loop.

Loop Unrolling

Loop unrolling replicates the loop body multiple times to decrease the number of branch tests and jumps, trading code size for speed. It exposes more straight-line code for instruction-level parallelism.

Technique:

Choose an unroll factor k based on loop trip count and register availability.
Replace the original loop with one that increments by k and contains k copies of the body.
Handle any leftover iterations with a separate loop or conditional tail.

Example:

// Original
sum = 0;
for (i = 0; i < 8; i++)
  sum += A[i];

// Unrolled by 4
sum = 0;
for (i = 0; i < 8; i += 4) {
  sum += A[i];
  sum += A[i+1];
  sum += A[i+2];
  sum += A[i+3];
}

This cuts the number of loop‐control branches from eight to two, reducing branch overhead and allowing better pipelining.

Loop Jamming (Loop Fusion)

Loop jamming, also known as fusion, merges multiple loops iterating over the same index range into one loop. This reduces loop overhead, improves cache utilization, and may unlock cross‐iteration optimizations.

Technique:

Verify that loops have the same bounds and that fusing does not change dependences.
Combine their bodies in sequence under one loop header.

Example:

// Before fusion
for (i = 0; i < n; i++)
  X[i] = A[i] + B[i];
for (i = 0; i < n; i++)
  Y[i] = C[i] * D[i];

// After fusion
for (i = 0; i < n; i++) {
  X[i] = A[i] + B[i];
  Y[i] = C[i] * D[i];
}

This halves the loop‐entry and branch instructions, and by touching arrays in one pass, it leverages temporal locality and can reduce cache misses.

Peephole Optimization

A peephole optimizer examines short sequences (“windows”) of generated machine instructions and applies local transformations to simplify them. It uses pattern‐matching rules to eliminate inefficiencies without global analysis.

Common Transformations:

Remove instructions with no effect (ADD R1, #0).
Delete redundant loads/stores (LOAD R1, X; LOAD R1, X).
Eliminate canceling moves (MOV R1, R2; MOV R2, R1).
Fold small sequences into single complex instructions if the architecture supports it (ADD [R1], #4 instead of separate ADD and STORE).

Example:

MOV R1, R2
MOV R2, R1

Both moves undo one another and can be deleted entirely. Peephole optimization shrinks code size and removes needless operations, improving runtime and memory usage.

Issues in the Design of Code Generator

A code generator must translate the intermediate representation into efficient machine code while respecting target‐machine constraints. Key challenges include:

Instruction Selection: Mapping IR operations to target instructions, choosing between multiple addressing modes and complex CISC vs simple RISC forms.
Register Allocation: Assigning a limited set of physical registers to many temporaries, deciding when to spill to memory.
Instruction Scheduling: Ordering instructions to avoid pipeline hazards and exploit parallel execution units.
Calling Conventions: Managing parameter passing, return‐value placement, and callee/caller‐saved registers.
Resource Constraints: Handling fixed instruction sizes, alignment, delay slots, and special machine idioms. Balancing these factors affects both code speed and size.

A Simple Code Generator

A simple (tree-walker) code generator performs a recursive traversal of the abstract syntax tree to emit target code directly. Steps for each AST node:

Recursively generate code for left and right subtrees, yielding registers or temporaries holding their results.
Emit the machine instruction corresponding to the node’s operator, using those registers.
Return the register containing the result.

Example for a + b * c:

// Visit '*' node
LOAD R1, b       ; R1 ← b
LOAD R2, c       ; R2 ← c
MUL  R3, R1, R2  ; R3 ← R1 * R2
// Visit '+' node
LOAD R4, a       ; R4 ← a
ADD  R5, R4, R3  ; R5 ← R4 + R3

While straightforward to implement, this approach may produce suboptimal code without additional passes for register allocation and instruction scheduling.

Register Allocation & Assignment

Register allocation assigns program values (variables or temporaries) to a limited set of machine registers to minimize slow memory accesses. Assignment maps each live value to a specific register or spills it to memory if registers are exhausted.

Graph-Coloring Method:

Build an interference graph where nodes are temporaries and an edge means two are live at the same time.
Color the graph with k colors, where k is the number of registers—adjacent nodes cannot share a color.
Spill (store to memory) nodes that cannot be colored, then repeat.

Linear-Scan Method:

Sort live intervals by start point.
Allocate registers in one pass, freeing them when intervals end; spill when none are available.

Example: Temporaries t1, t2, t3 are all live simultaneously but only two registers R1, R2 exist. The interference graph is a triangle. A coloring algorithm assigns R1 to t1, R2 to t2, and spills t3 to memory (inserting load/store around its uses). Effective register allocation reduces load/store penalties and significantly speeds up execution.