Low Level Design: Compiler Design Basics

⏱ 3 min read

A compiler translates source code in a high-level language to machine code (or bytecode) through a series of structured phases. Understanding compiler design explains how programming languages work under the hood — how code is parsed, type-checked, optimized, and executed. Compiler concepts appear in system design interviews for database query planners, template engines, configuration languages, and domain-specific languages. LLVM, JVM, V8, and database query compilers (Spark Catalyst, ClickHouse LLVM JIT) all implement these phases.

Lexing and Parsing: From Text to AST

Lexer (tokenizer): scans source text character by character and groups characters into tokens: keywords (if, while, return), identifiers (variableName), literals (42, “hello”), operators (+, ==), punctuation ({, }). Regular expressions define token patterns. Output: a flat stream of tokens. Parser: takes the token stream and builds a tree structure (Abstract Syntax Tree, AST) that represents the grammatical structure of the program. Parsing algorithms: recursive descent (hand-written, top-down, easy to understand — used by Go, Clang), LL(k) (table-driven, top-down), LALR(1) (table-driven, bottom-up, used by yacc/bison). The AST represents the program as nested nodes: FunctionDecl → BlockStmt → IfStmt → BinaryExpr → etc. Error recovery: parsers attempt to continue parsing after syntax errors to report multiple errors in one pass.

// Recursive descent parser example (simplified)
// Grammar: expr := term ((+ | -) term)*
//          term  := factor ((* | /) factor)*
//          factor := NUMBER | ( expr )

func parseExpr(tokens []Token) (ASTNode, error) {
    left, err := parseTerm(tokens)
    if err != nil { return nil, err }

    for peek(tokens) == PLUS || peek(tokens) == MINUS {
        op := consume(tokens)  // consume + or -
        right, err := parseTerm(tokens)
        if err != nil { return nil, err }
        left = BinaryExpr{Op: op, Left: left, Right: right}
    }
    return left, nil
}

// AST for "2 + 3 * 4":
//   BinaryExpr(+)
//     Literal(2)
//     BinaryExpr(*)
//       Literal(3)
//       Literal(4)

// SQL query compilation (similar pipeline):
// SELECT name FROM users WHERE age > 25
// Tokenize -> Parse -> AST -> Semantic analysis (type check, resolve columns)
// -> Logical plan -> Query optimizer -> Physical plan -> Execute

Semantic Analysis and Type Checking

Semantic analysis validates meaning beyond syntax. The symbol table maps identifiers to their types and scopes. Type checking ensures operations are applied to compatible types (cannot add a string and an integer unless implicit conversion is defined). Scope resolution: when a variable is used, walk the scope chain to find its declaration. Type inference (Go, Rust, Haskell): deduce types from context without explicit annotations. Hindley-Milner type inference (Haskell) infers types for the entire program with minimal annotations. Error examples: use of undefined variable, type mismatch (expected int, got string), null pointer dereference (Rust prevents at compile time with ownership rules), out-of-bounds access (caught by bounds checking at runtime or statically by some analyses).

Optimization and Code Generation

Optimizers transform the intermediate representation (IR) to improve performance without changing semantics. Common optimizations: Constant folding: evaluate constant expressions at compile time (2 + 3 → 5). Dead code elimination: remove code that cannot be reached or whose results are never used. Inlining: replace function calls with the function body (eliminates call overhead for small functions). Loop unrolling: replicate the loop body multiple times to reduce branch overhead. Vectorization: transform scalar loops into SIMD instructions. LLVM IR: the intermediate representation used by Clang (C/C++), Rust, Swift, and Julia. LLVM performs all optimizations on IR, then generates machine code for any target architecture. JIT compilation (JVM HotSpot, V8, LuaJIT): interpret initially, detect hot code paths, compile to native code at runtime — adapts optimizations to actual runtime behavior.

Key Interview Discussion Points

Database query compilation: SQL query planners implement all compiler phases — parsing (SQL text to AST), semantic analysis (resolve table/column names, type check), logical optimization (predicate pushdown, join reordering), physical planning (choose index scan vs full scan), code generation (ClickHouse JIT-compiles query execution to LLVM IR for 2-5x speedup)
Interpreter vs compiler: interpreters execute AST nodes directly (Python CPython), compilers generate machine code; JIT compilation combines both — start interpreting, compile hot paths; AOT (ahead-of-time) compilation (Rust, Go) generates machine code before execution
SSA (Static Single Assignment): an IR form where each variable is assigned exactly once, making data flow analysis and optimization easier; LLVM IR, GCC GIMPLE, and JVM bytecode use SSA internally
Tree-sitter: incremental parser designed for code editors (syntax highlighting, code navigation) — parses only changed portions of the file on each keystroke, producing a CST (Concrete Syntax Tree) in milliseconds; used by Neovim, GitHub, and Zed
Template engines as mini-compilers: Jinja2, Mustache, and Go templates parse template syntax into ASTs, apply data binding, and render output — all three compiler phases in miniature; understanding this helps design custom DSLs