How does the self-attention mechanism work in Transformers?

Each token is projected into three vectors: Query (Q), Key (K), and Value (V) via learned weight matrices. Attention = softmax(QK^T / sqrt(d_k)) * V. The QK^T dot product computes similarity between every pair of tokens. Softmax normalizes to attention weights (probabilities). The weighted sum of V produces output representations. Division by sqrt(d_k) prevents softmax saturation (vanishing gradients with large dot products). Multi-head attention runs H parallel heads with different projections, each learning different relationship types (syntactic, semantic, positional). Outputs are concatenated and projected. Self-attention captures long-range dependencies: in The cat sat on the mat because it was tired, it attends to cat regardless of distance. RNNs struggle because information must flow sequentially through all intermediate tokens.

AI/ML Interview: Transformer Architecture — Attention Mechanism, BERT, GPT, Self-Attention, Positional Encoding

Q: What is the difference between BERT and GPT architectures?

BERT (encoder-only): bidirectional attention -- each token attends to ALL tokens (left and right). Pre-trained with Masked Language Modeling (predict randomly masked tokens from full context). Best for understanding tasks: classification, NER, question answering, semantic similarity. BERT sees complete context for each token representation. GPT (decoder-only): causal attention -- each token attends only to PREVIOUS tokens (left context). Pre-trained with next-token prediction. Best for generation: text completion, conversation, code generation. Generates one token at a time, left to right. Modern LLMs (GPT-4, Claude, LLaMA) are all decoder-only because: it scales more efficiently, generation is the primary use case, and understanding tasks can be reformulated as generation via few-shot prompting. The original Transformer had both encoder and decoder (used by T5, BART for translation and summarization).

⏱ 6 min read

The Transformer architecture (Vaswani et al., 2017, “Attention Is All You Need”) is the foundation of modern AI — powering GPT-4, Claude, BERT, and virtually every large language model. Understanding Transformers is essential for ML interviews at any company building or using LLMs. This guide covers the architecture from self-attention to pre-training, with the depth expected at ML engineering interviews.

Self-Attention Mechanism

Self-attention allows each token to attend to every other token in the sequence, computing a weighted sum of all token representations. For a sequence of N tokens, each token is projected into three vectors: Query (Q), Key (K), and Value (V) using learned weight matrices. Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V. The QK^T dot product computes a similarity score between every pair of tokens. The softmax normalizes these scores into attention weights (probabilities summing to 1). The weighted sum of V produces the output representation for each token. Why sqrt(d_k): without scaling, the dot products grow large with dimension d_k, pushing the softmax into regions with tiny gradients (vanishing gradient problem). Dividing by sqrt(d_k) keeps the variance stable. What self-attention captures: long-range dependencies. In “The cat sat on the mat because it was tired,” self-attention allows “it” to attend strongly to “cat” regardless of distance. RNNs struggle with this because information must flow sequentially through each intermediate token.

Multi-Head Attention

A single attention head captures one type of relationship. Multi-head attention runs H parallel attention heads, each with different learned Q, K, V projections. Each head can learn different relationship types: one head might attend to syntactic dependencies (subject-verb), another to semantic relationships (pronouns to their referents), and another to positional patterns (adjacent tokens). The H head outputs are concatenated and projected through a linear layer. MultiHead(Q, K, V) = Concat(head_1, …, head_H) * W_O where head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i). Typical values: d_model = 768 (BERT-base), H = 12 heads, d_k = d_model / H = 64 per head. The computational cost is the same as single-head attention with full d_model because each head operates on d_k = d_model / H dimensions. Multi-head attention is the core innovation that makes Transformers expressive enough to model language.

Positional Encoding

Self-attention is permutation-invariant — it has no notion of token order. “Dog bites man” and “Man bites dog” produce the same attention scores without positional information. Positional encoding adds position information to token embeddings. Original Transformer: sinusoidal positional encoding. PE(pos, 2i) = sin(pos / 10000^(2i/d_model)). PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). Each position gets a unique pattern of sine and cosine values. The relative position between two tokens can be computed from their encodings (a linear transformation relates PE(pos+k) to PE(pos)). Learned positional embeddings (BERT, GPT): a learned embedding table with one embedding per position (up to the maximum sequence length, e.g., 512 for BERT). Simpler and equally effective for fixed-length sequences. RoPE (Rotary Position Embedding, used in LLaMA, Mistral): encodes position by rotating the Q and K vectors in 2D subspaces. Preserves relative position information after the dot product. Enables better length generalization than absolute positional embeddings. ALiBi (Attention with Linear Biases): adds a position-dependent bias to the attention scores. Simpler than RoPE and generalizes well to longer sequences than seen during training.

Transformer Block

A Transformer block (layer) consists of: (1) Multi-head self-attention. (2) Add and normalize (residual connection + layer normalization). (3) Feed-forward network (two linear layers with a non-linearity: FFN(x) = GELU(x * W_1 + b_1) * W_2 + b_2). The FFN has a hidden dimension typically 4x the model dimension (d_ff = 4 * d_model). (4) Add and normalize. The residual connections (x + sublayer(x)) are critical: they allow gradients to flow directly through the network, enabling training of very deep models (96 layers in GPT-3). Without residuals, training diverges. Layer normalization stabilizes training by normalizing the activations within each layer. Pre-norm (normalize before the sublayer) is preferred over post-norm (normalize after) in modern architectures because it produces more stable gradients during training. A complete Transformer model stacks N blocks: BERT-base = 12 layers, GPT-2 = 48 layers, GPT-3 = 96 layers, GPT-4 is rumored to be a mixture of experts with many layers. Each additional layer allows the model to capture more complex patterns and abstractions.

Encoder vs Decoder: BERT vs GPT

The original Transformer has both an encoder (processes input) and a decoder (generates output). Modern models use one or the other. Encoder-only (BERT): bidirectional attention — each token attends to all tokens (left and right). Pre-trained with Masked Language Modeling (randomly mask 15% of tokens, predict the masked tokens from context). Excellent for understanding tasks: classification, NER, question answering, semantic similarity. BERT sees the full context when representing each token. Decoder-only (GPT): causal (autoregressive) attention — each token attends only to previous tokens (left context). Pre-trained with next-token prediction (given tokens 1..N, predict token N+1). Excellent for generation tasks: text completion, conversation, code generation. GPT generates one token at a time, left to right. Encoder-decoder (T5, BART): the encoder processes the input bidirectionally. The decoder generates the output autoregressively, with cross-attention to the encoder output. Used for: translation, summarization, and any sequence-to-sequence task. Modern LLMs (GPT-4, Claude, LLaMA, Mistral) are all decoder-only. The decoder-only architecture dominates because: it scales more efficiently, generation is the primary use case, and understanding tasks can be reformulated as generation (few-shot prompting).

Tokenization

Transformers operate on tokens, not characters or words. Tokenization breaks text into subword units. BPE (Byte Pair Encoding, used by GPT): start with individual characters. Iteratively merge the most frequent pair into a single token. “unhappiness” might tokenize as [“un”, “happiness”] or [“un”, “happ”, “iness”] depending on the vocabulary. The merge rules are learned from a training corpus. Vocabulary size: 50,000-100,000 tokens. WordPiece (BERT): similar to BPE but uses a likelihood-based criterion for merges instead of frequency. SentencePiece (T5, LLaMA): a language-agnostic tokenizer that operates on raw text (including whitespace). Handles multilingual text naturally. Why subword tokenization: (1) Fixed vocabulary handles any input (no out-of-vocabulary problem). (2) Common words are single tokens (“the”, “is”). Rare words are split into subwords (“un” + “happiness”). (3) The vocabulary is compact (50K tokens vs millions of unique words). Tokenization affects model behavior: GPT-4 uses ~100K tokens. More tokens = more of the text is represented by single tokens (faster inference, better comprehension). Fewer tokens = more splitting, more inference steps per text unit. For code: specialized tokenizers handle programming language syntax (indentation, operators, identifiers).