AI/ML Interview: AI System Design Framework — ML System Design Interview, Requirements, Data, Training, Serving

ML system design interviews evaluate your ability to design end-to-end machine learning systems — from problem formulation to production monitoring. Unlike coding interviews (single algorithm) or traditional system design (infrastructure), ML system design tests: problem framing, data strategy, model selection, training pipeline, serving architecture, and feedback loops. This guide provides the framework used at Google, Meta, Netflix, and other ML-heavy companies.

The ML System Design Framework

Follow this structure for any ML system design question: (1) Problem Formulation (5 min) — clarify the business objective, define the ML task (classification, ranking, generation), choose the optimization metric, and identify constraints (latency, freshness, fairness). (2) Data (5 min) — identify data sources, define features, discuss labeling strategy, and address data quality. (3) Model (10 min) — select the model architecture, define the training pipeline, discuss offline evaluation. (4) Serving (10 min) — design the inference architecture, discuss online evaluation (A/B testing), and address latency and scaling. (5) Monitoring and Iteration (5 min) — define monitoring metrics, discuss feedback loops, and plan for model updates. Unlike traditional system design where you draw infrastructure boxes, ML system design requires discussing: what the model learns (features, labels), how it learns (training data, architecture, loss function), and how you know it works (offline metrics, online metrics, business impact). The interviewer evaluates: can you frame a business problem as an ML problem? Can you design the data pipeline that feeds the model? Can you make practical tradeoffs (model complexity vs latency, accuracy vs fairness)?

Step 1: Problem Formulation

The most important step. A wrong formulation means everything else is wasted. For “design a spam filter”: Business objective: reduce spam reaching user inboxes while minimizing false positives (legitimate emails marked as spam). ML task: binary classification (spam / not spam) on each incoming email. Optimization metric: maximize recall (catch most spam) subject to precision > 99.5% (fewer than 0.5% of legitimate emails are misclassified). Why this constraint: users tolerate seeing occasional spam (low recall = annoying but not catastrophic) but do not tolerate missing important emails (low precision = emails from clients or family going to spam = unacceptable). For “design a news feed ranking system”: Business objective: maximize user engagement (time spent, sessions per day) while maintaining content quality. ML task: ranking (score each candidate post for the target user, sort by score). Optimization metric: NDCG@10 (offline), and engagement metrics (click-through rate, time spent, sessions) measured via A/B testing (online). Constraints: latency < 200ms per feed request (ranking must be fast), fairness (do not systematically demote content from minority creators), and safety (filter harmful content before ranking). Always ask the interviewer: "What is the primary business metric?" "What are the latency requirements?" "Are there fairness or safety constraints?"

Step 2: Data Strategy

ML models are only as good as their training data. For each feature: where does it come from? How fresh is it? How is it computed? Feature categories: (1) User features — demographics (age, location, language), behavioral (past clicks, purchases, time on platform), and derived (user embedding from interaction history). (2) Item features — content (text, image, category), metadata (publish date, author, source), and quality signals (engagement rate, report rate). (3) Context features — time of day, day of week, device type, location. These change per request. (4) User-item interaction features — has the user seen this item before? How many times has the user interacted with this author? These are the most predictive but also the most expensive to compute. Labeling strategy: (1) Explicit labels — user clicks “spam” button, user rates a movie 4 stars. High quality but sparse. (2) Implicit labels — user clicks on a search result (positive), does not click (weak negative). User watches 90% of a video (positive engagement). Abundant but noisy. (3) Human annotation — hire labelers to annotate data. Expensive but necessary for tasks without natural user feedback (content moderation, medical classification). Data freshness: how often is the training data updated? For a spam filter: new spam patterns emerge daily. The model must be retrained frequently (daily or weekly) on recent data. For a recommendation system: user preferences shift over time. Stale models lose relevance.

Step 3: Model Architecture

Choose the simplest model that meets the requirements, then iterate. Model selection hierarchy: (1) Start with a baseline — logistic regression or gradient boosted trees (XGBoost, LightGBM). Fast to train, easy to interpret, and often surprisingly competitive. (2) If the baseline is insufficient — deep learning models. For text: BERT fine-tuned or a transformer. For images: ResNet or ViT. For ranking: a two-tower model (candidate generation) + a cross-network (ranking). (3) For production at scale — the two-stage retrieval + ranking pipeline. Retrieval: fast, approximate (two-tower, embedding similarity). Ranking: slow, accurate (deep model with cross-features). Training pipeline: (1) Data split: time-based split for temporal data (train on last 30 days, validate on day 31, test on day 32). Random split only for i.i.d. data. (2) Feature engineering: compute features, handle missing values, encode categoricals. (3) Training: define the model, loss function, optimizer. Train with early stopping (stop when validation metric stops improving). (4) Hyperparameter tuning: grid search or Bayesian optimization on the validation set. (5) Offline evaluation: compute metrics (AUC-PR, NDCG, F1) on the test set. Compare with the current production model. If better: promote to A/B testing. In the interview: explain your model choice with reasoning. “I start with LightGBM because it handles mixed feature types well, trains in minutes, and achieves strong baselines. If this is insufficient, I would explore a two-tower neural model for better embedding-based retrieval.”

Step 4: Serving and Online Evaluation

Serving architecture depends on latency requirements: (1) Real-time serving (< 100ms) — model runs as an API service. Feature computation happens online (query the feature store). The model executes a forward pass and returns the prediction. For ranking: retrieve 1000 candidates from a pre-computed index (fast), rank with the deep model (slower but on a small set). (2) Batch serving — pre-compute predictions for all users. Store in a cache/database. Serve from cache on request. Use for: daily email recommendations, nightly fraud scoring. Latency: near-zero (cache lookup). Staleness: predictions are hours old. (3) Hybrid — batch pre-compute candidate embeddings. Real-time compute user embedding and ANN search. Combine batch efficiency with real-time personalization. Online evaluation: deploy the new model via A/B test. Control: current production model (90% traffic). Treatment: new model (10% traffic). Compare: business metrics (CTR, conversion, revenue, engagement) over 1-2 weeks. Statistical significance: p < 0.05 for the primary metric. Guardrail metrics: monitor for regressions (latency, error rate, user complaints). If guardrails degrade: automatically revert the treatment. Promote to 100% only after: primary metric improves AND guardrails hold. The interview should include: "I would serve the model via a REST API with P99 < 50ms. Features are fetched from the feature store online store (Redis). I would A/B test against the current model with CTR as the primary metric and latency as a guardrail."

Step 5: Monitoring and Feedback Loops

The model is not done after deployment. Monitor: (1) Data drift — input feature distributions change over time. Detect with KS test or PSI. Trigger retraining when drift exceeds threshold. (2) Prediction drift — the distribution of model outputs changes. If the model suddenly predicts 80% spam (was 5%), the data pipeline may be broken. (3) Online metric degradation — if CTR drops after model deployment, the model may be worse than expected (despite good offline metrics). Revert and investigate. Feedback loops: the model predictions influence the data it trains on. If the spam filter blocks all spam emails, future training data has no spam examples — the model forgets how to detect spam. Mitigation: periodically sample and manually label unfiltered traffic. Retraining schedule: (1) Scheduled — retrain daily/weekly on recent data. The simplest approach. (2) Triggered — retrain when data drift exceeds a threshold or online metrics degrade. More efficient but requires monitoring infrastructure. (3) Continuous — update the model incrementally with each new data batch (online learning). Most responsive but complex to implement correctly. In the interview: “I would monitor data drift daily using PSI. Retrain weekly on the last 30 days of data. If online CTR drops by > 5% relative, automatically roll back to the previous model and alert the ML team.”

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the framework for ML system design interviews?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Five steps: (1) Problem Formulation (5 min): clarify business objective, define ML task (classification/ranking/generation), choose optimization metric, identify constraints (latency, fairness). (2) Data (5 min): identify sources, define features, discuss labeling strategy and data quality. (3) Model (10 min): select architecture (start simple — LightGBM baseline, then deep learning if needed), define training pipeline, discuss offline evaluation. (4) Serving (10 min): inference architecture (real-time API vs batch), online evaluation (A/B testing), latency and scaling. (5) Monitoring (5 min): data drift detection, feedback loops, retraining schedule. Unlike traditional system design, ML system design requires discussing: what the model learns (features, labels), how it learns (data, architecture, loss), and how you know it works (offline metrics, online A/B tests, business impact).”}},{“@type”:”Question”,”name”:”How do you choose between starting with a simple model vs deep learning?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Always start with the simplest model that could work: logistic regression or gradient boosted trees (XGBoost/LightGBM). These are fast to train, easy to interpret, handle mixed feature types, and are often surprisingly competitive. Move to deep learning only if: the baseline is insufficient AND you have enough data AND the problem benefits from representation learning (text/image understanding, complex feature interactions). For ranking at scale: two-tower neural model for retrieval (embedding-based) + cross-network for ranking. For NLP: fine-tuned BERT or LLM. For vision: pre-trained ResNet/ViT fine-tuned. In the interview: I start with LightGBM because it trains in minutes and handles mixed features well. If insufficient, I would explore a two-tower model for better embedding-based retrieval. This shows engineering judgment — not jumping to the most complex solution.”}}]}
Scroll to Top