Autocorrect differs from a spell checker in intent: rather than flagging errors for the user to fix, it silently replaces input with the most probable correction. The core model is the noisy channel: given observed input w, find the correction c that maximizes P(c|w). By Bayes' theorem this is proportional to P(w|c) × P(c), where P(w|c) is the channel model (how likely is this typo given the intended word?) and P(c) is the language model (how likely is this word in context?).
Keyboard Distance Model (Channel Model)
P(w|c) is estimated from a keyboard layout distance matrix. For a QWERTY layout, precompute for every pair of keys their Euclidean distance on the physical keyboard. Adjacent keys (e.g., ‘s’ and ‘d’) get a high confusion probability; distant keys get a low one. The channel model for a one-character substitution is proportional to exp(−distance(observed, intended)). For insertions and deletions, use a fixed penalty. This gives a continuous-valued P(w|c) rather than the binary ±1 edit distance used in basic spell checkers.
Build the confusion matrix from empirical typo data when available (keyboard distance is a prior; observed typo logs are a better signal). Store as a 26×26 float matrix in memory — negligible space.
Language Model (Prior)
P(c) is estimated by an n-gram language model trained on a large text corpus. For autocorrect, a trigram model (P(c | w_{i-2}, w_{i-1})) captures enough context for most corrections. Store the model as a trie over n-gram prefixes with Kneser-Ney smoothing to handle unseen combinations. The model fits in 1–2 GB RAM for a vocabulary of ~1M words with 5-gram back-off.
Candidate Generation
Generate candidates at edit distance 1 and 2 from the input token. Edit distance 1 covers ~80% of real typos; edit distance 2 handles the rest at the cost of more candidates. Use the BK-tree structure (described in the spell checker post) for efficient lookup. Cap candidates at 50 before scoring to bound latency.
Reranking
Score each candidate c as:
score(c) = α × log P(w|c) + β × log P(c | context)
α and β are tuned on a held-out labeled dataset. Select the top-scoring candidate. When the top candidate score is below a confidence threshold, do not autocorrect — surface it as a suggestion instead. This prevents autocorrect from changing words the model is uncertain about.
Personalization
Learn from user behavior: when a user accepts or rejects a correction, update a per-user frequency table. When a user types a word not in the global vocabulary multiple times without correction, add it to their personal dictionary. These signals adjust the prior P(c) for that user: personal words get a boosted frequency, and corrections the user repeatedly rejects get suppressed.
Store personalization data as a compact delta over the global model (user-specific unigram adjustments and a blocklist). Load at session start from a fast key-value store. Keep the delta small (cap at 10k entries per user) to bound memory.
Real-Time Inference Pipeline
The pipeline runs on every keystroke after word boundary detection (space or punctuation). Target latency is under 50ms end-to-end. Breakdown: candidate generation via BK-tree (~5ms), scoring 50 candidates with n-gram lookup (~10ms), personalization adjustment (~2ms), network (~5ms). Total leaves headroom.
Cache correction results for common inputs in an LRU cache keyed by (token, preceding context bigram). Cache invalidation on personalization updates.
Neural Approach
State-of-the-art autocorrect uses a transformer-based sequence-to-sequence model (e.g., fine-tuned T5 or a purpose-built encoder-decoder). The model takes the full sentence as input and outputs the corrected sentence, capturing long-range context that n-grams miss. The trade-off is latency: a small transformer (60M parameters) runs in ~30ms on CPU per sentence, which is acceptable for sentence-level correction on submit but too slow for per-keystroke correction. In practice, production systems use n-gram models for real-time and neural re-scoring on sentence completion.
Frequently Asked Questions
How does autocorrect work using the noisy channel model?
The noisy-channel model treats the user’s typed input as a “noisy” version of the intended word. The corrector finds the candidate word w that maximizes P(w) * P(typed | w), where P(w) is the prior probability of the word from a language model and P(typed | w) is the likelihood of the observed keystrokes given that w was intended. This probabilistic framing naturally balances word frequency against typographic plausibility, producing corrections that feel intuitive.
How does keyboard layout distance improve autocorrect accuracy?
Keyboard layout distance quantifies how physically close two keys are on a QWERTY (or other) layout. It is incorporated into the error model P(typed | candidate): adjacent-key substitutions receive a high probability because they are common fat-finger mistakes, while distant-key substitutions receive a low probability. This prevents the system from suggesting words that differ by letters far apart on the keyboard, dramatically reducing false corrections and improving perceived accuracy.
How do you use an n-gram language model to rank correction candidates?
After generating a candidate set via edit-distance search, each candidate is scored using an n-gram language model that considers surrounding context. For a trigram model, the score of candidate w_i is based on P(w_i | w_{i-2}, w_{i-1}), the probability of that word given the two preceding words. Candidates are re-ranked by this contextual score, so the model prefers words that fit the sentence naturally over words that are merely close in edit distance. Kneser-Ney smoothing handles unseen n-grams.
How do you personalize autocorrect for individual users?
Personalization is achieved by maintaining a per-user n-gram model or word-frequency table that is updated as the user types and accepts or rejects corrections. Accepted corrections reinforce the probability of that candidate in the user’s personal model; rejected corrections suppress it. The personal model is blended with the global model using interpolation (e.g., 70% global, 30% personal), with the blend weight tunable per user. The model is stored compactly in the user’s profile and loaded on-device or fetched server-side at session start.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Atlassian Interview Guide