System design interviews terrify most engineers because there is no single correct answer — and interviewers rarely give much direction. This guide teaches a battle-tested framework that works at Google, Meta, Amazon, and any FAANG-tier company.
The 45-Minute System Design Framework
| Phase |
Time |
What to do |
| 1. Requirements |
5 min |
Clarify functional + non-functional requirements |
| 2. Scale estimation |
3 min |
Back-of-envelope: QPS, storage, bandwidth |
| 3. High-level design |
10 min |
Draw boxes + arrows: clients, API, services, DB |
| 4. Deep dive |
20 min |
Pick 2-3 critical components, go deep |
| 5. Trade-offs |
5 min |
Discuss alternatives you chose not to use and why |
| 6. Wrap-up |
2 min |
Summarize design, mention what you would do next |
Phase 1: Requirements Clarification (Never Skip This)
"""
Ask these for EVERY system design problem:
Functional requirements (what it does):
- "What are the core features you want me to design?"
- "Should I include search/notifications/analytics or just core flow?"
- "Read-heavy or write-heavy?"
- "Mobile, web, or both?"
Non-functional requirements (how well it does it):
- "What scale are we designing for? DAU / MAU?"
- "What latency do we need? (p50? p99?)"
- "Consistency vs availability — any specific requirements?"
- "Any SLA requirements?"
Clarifications that change your design entirely:
- Twitter: "Are there celebrities with 100M followers?"
→ Yes = hybrid fan-out; No = simple fan-out on write
- Ride-sharing: "Is pricing fixed or surge?"
→ Surge = need real-time supply/demand signals
- Notification: "Can notifications be delayed 30s?"
→ Yes = can batch; No = need real-time push
Red flag: engineers who immediately start drawing boxes without
asking any questions. Always spend 5 minutes on requirements.
"""
Phase 2: Back-of-Envelope Estimation
"""
Numbers every engineer should know:
Latency:
L1 cache: 0.5ns
L2 cache: 7ns
RAM: 100ns
SSD random read: 100 microseconds
HDD random read: 10ms
Network round trip: 50-150ms (cross-continent)
Storage:
char/byte: 1 byte
int: 4 bytes
long/double: 8 bytes
UUID: 16 bytes
Tweet text (280 chars): ~500 bytes with metadata
Photo (compressed): ~300KB
Video (1 min, 720p): ~50MB
Throughput:
DB: 10,000 reads/sec per replica | 2,000 writes/sec
Redis: 100,000 ops/sec
Kafka: 1,000,000 messages/sec per partition
Estimation example for Twitter (100M DAU):
Writes: 100M DAU * 1 tweet/day / 86,400 sec = ~1,200 tweets/sec
Reads: 100M DAU * 50 timeline reads / 86,400 = ~58,000 reads/sec
Read/write ratio: ~50:1
Storage per tweet: 500 bytes
Daily tweet storage: 1,200/s * 86,400s * 500 bytes = ~52GB/day
3-year total: 52GB * 365 * 3 = ~57TB (manageable, single region)
"""
Phase 3: High-Level Design — Components to Always Mention
"""
Standard starting architecture (modify as needed):
Clients (Mobile/Web)
|
CDN ←-- static assets, images, cached API responses
|
Load Balancer (L7, Layer 7 — HTTP-aware)
|
API Gateway ←-- auth, rate limiting, routing
|
Service Layer ←-- one or more microservices / monolith
| ←-- async: Message Queue (Kafka/SQS)
|
Primary DB → Read Replicas (for read scaling)
|
Cache (Redis) ←-- hot data, session, computed results
Supporting services:
- Object Storage (S3) for files/media
- Search (Elasticsearch) if needed
- CDN (CloudFront) for media delivery
Always explain WHY each component exists, not just draw it.
Bad: "Here is a cache."
Good: "I am adding Redis here because the user profile is read
on every API call, and reading from the DB every time
would add 10ms and hurt our p99 latency target."
"""
Phase 4: Deep Dive — Choose Wisely
"""
Your interviewer will say: "Let us go deeper on one area."
You have a choice. Pick the area where you are strongest OR
ask them: "Which part interests you most?"
High-value areas to deep dive:
1. Data model / schema design
- Show you understand normalization, indices, partitioning
- Draw entity relationships, explain key design decisions
2. Scalability bottleneck
- Identify the single component that will break first
- Explain how to scale it (read replicas, sharding, caching)
3. The hard technical problem
- Fan-out for social feed
- Consistent hashing for distributed KV store
- Lag compensation for games
- Exactly-once in message queues
4. API design
- REST endpoint signatures with HTTP verbs
- Request/response schemas
- Pagination strategy
What NOT to deep dive (unless asked):
- Infrastructure / deployment details
- Monitoring / logging (mention, but do not dwell)
- Security (mention authentication, not full threat model)
"""
Common Mistakes and How to Avoid Them
| Mistake |
Why it fails |
Fix |
| Jumping straight to architecture |
Design the wrong thing; miss scope |
Spend 5 minutes on requirements first |
| Over-engineering the scale |
Designing for 1B users when asked for MVP |
Confirm scale explicitly; start simple, scale up |
| Monologue without pauses |
Interviewer cannot redirect; you miss signals |
Pause every 2-3 minutes: “Does this direction make sense?” |
| Vague component names |
“Database” is not an answer |
Name specific technology (PostgreSQL, Cassandra, Redis) with reason |
| Ignoring trade-offs |
Shows lack of engineering judgment |
For every choice, say “I chose X over Y because…” |
| Not handling failure cases |
Production systems fail constantly |
Mention retries, circuit breakers, DLQ, fallback behavior |
| Drawing boxes without explaining |
Interviewer cannot assess understanding |
Narrate your thinking as you draw |
| Giving up when stuck |
Shows inability to handle ambiguity |
Think out loud, eliminate options, ask for hints if needed |
System Design Study Roadmap
- Week 1: Master foundations — Consistent Hashing, CAP Theorem, Sharding, Caching, Message Queues
- Week 2: Classic problems — URL Shortener, Twitter Feed, YouTube, WhatsApp, Uber
- Week 3: Advanced problems — Google Search, Distributed Key-Value Store, Payment System
- Week 4: Mock interviews — practice explaining out loud, not just thinking silently
Numbers to Memorize for Estimation
| Unit |
Value |
| 1 million seconds |
~11.5 days |
| 1 billion seconds |
~31.7 years |
| Requests per day (1M users, 10 req/user) |
10M / 86,400 ≈ 115 req/sec |
| 1 year of 10KB writes at 1k/sec |
315TB |
| Read replica: up to |
5-10x read throughput of primary |
| Single Kafka partition |
100MB/s write, 500MB/s read |
| S3 throughput per prefix |
5,500 GET/sec, 3,500 PUT/sec |