System design interviews terrify most engineers because there is no single correct answer — and interviewers rarely give much direction. This guide teaches a battle-tested framework that works at Google, Meta, Amazon, and any FAANG-tier company.
"""
Ask these for EVERY system design problem:
Functional requirements (what it does):
- "What are the core features you want me to design?"
- "Should I include search/notifications/analytics or just core flow?"
- "Read-heavy or write-heavy?"
- "Mobile, web, or both?"
Non-functional requirements (how well it does it):
- "What scale are we designing for? DAU / MAU?"
- "What latency do we need? (p50? p99?)"
- "Consistency vs availability — any specific requirements?"
- "Any SLA requirements?"
Clarifications that change your design entirely:
- Twitter: "Are there celebrities with 100M followers?"
→ Yes = hybrid fan-out; No = simple fan-out on write
- Ride-sharing: "Is pricing fixed or surge?"
→ Surge = need real-time supply/demand signals
- Notification: "Can notifications be delayed 30s?"
→ Yes = can batch; No = need real-time push
Red flag: engineers who immediately start drawing boxes without
asking any questions. Always spend 5 minutes on requirements.
"""
Phase 2: Back-of-Envelope Estimation
"""
Numbers every engineer should know:
Latency:
L1 cache: 0.5ns
L2 cache: 7ns
RAM: 100ns
SSD random read: 100 microseconds
HDD random read: 10ms
Network round trip: 50-150ms (cross-continent)
Storage:
char/byte: 1 byte
int: 4 bytes
long/double: 8 bytes
UUID: 16 bytes
Tweet text (280 chars): ~500 bytes with metadata
Photo (compressed): ~300KB
Video (1 min, 720p): ~50MB
Throughput:
DB: 10,000 reads/sec per replica | 2,000 writes/sec
Redis: 100,000 ops/sec
Kafka: 1,000,000 messages/sec per partition
Estimation example for Twitter (100M DAU):
Writes: 100M DAU * 1 tweet/day / 86,400 sec = ~1,200 tweets/sec
Reads: 100M DAU * 50 timeline reads / 86,400 = ~58,000 reads/sec
Read/write ratio: ~50:1
Storage per tweet: 500 bytes
Daily tweet storage: 1,200/s * 86,400s * 500 bytes = ~52GB/day
3-year total: 52GB * 365 * 3 = ~57TB (manageable, single region)
"""
Phase 3: High-Level Design — Components to Always Mention
"""
Standard starting architecture (modify as needed):
Clients (Mobile/Web)
|
CDN ←-- static assets, images, cached API responses
|
Load Balancer (L7, Layer 7 — HTTP-aware)
|
API Gateway ←-- auth, rate limiting, routing
|
Service Layer ←-- one or more microservices / monolith
| ←-- async: Message Queue (Kafka/SQS)
|
Primary DB → Read Replicas (for read scaling)
|
Cache (Redis) ←-- hot data, session, computed results
Supporting services:
- Object Storage (S3) for files/media
- Search (Elasticsearch) if needed
- CDN (CloudFront) for media delivery
Always explain WHY each component exists, not just draw it.
Bad: "Here is a cache."
Good: "I am adding Redis here because the user profile is read
on every API call, and reading from the DB every time
would add 10ms and hurt our p99 latency target."
"""
Phase 4: Deep Dive — Choose Wisely
"""
Your interviewer will say: "Let us go deeper on one area."
You have a choice. Pick the area where you are strongest OR
ask them: "Which part interests you most?"
High-value areas to deep dive:
1. Data model / schema design
- Show you understand normalization, indices, partitioning
- Draw entity relationships, explain key design decisions
2. Scalability bottleneck
- Identify the single component that will break first
- Explain how to scale it (read replicas, sharding, caching)
3. The hard technical problem
- Fan-out for social feed
- Consistent hashing for distributed KV store
- Lag compensation for games
- Exactly-once in message queues
4. API design
- REST endpoint signatures with HTTP verbs
- Request/response schemas
- Pagination strategy
What NOT to deep dive (unless asked):
- Infrastructure / deployment details
- Monitoring / logging (mention, but do not dwell)
- Security (mention authentication, not full threat model)
"""
Common Mistakes and How to Avoid Them
Mistake
Why it fails
Fix
Jumping straight to architecture
Design the wrong thing; miss scope
Spend 5 minutes on requirements first
Over-engineering the scale
Designing for 1B users when asked for MVP
Confirm scale explicitly; start simple, scale up
Monologue without pauses
Interviewer cannot redirect; you miss signals
Pause every 2-3 minutes: “Does this direction make sense?”
Vague component names
“Database” is not an answer
Name specific technology (PostgreSQL, Cassandra, Redis) with reason
Week 3: Advanced problems — Google Search, Distributed Key-Value Store, Payment System
Week 4: Mock interviews — practice explaining out loud, not just thinking silently
Numbers to Memorize for Estimation
Unit
Value
1 million seconds
~11.5 days
1 billion seconds
~31.7 years
Requests per day (1M users, 10 req/user)
10M / 86,400 ≈ 115 req/sec
1 year of 10KB writes at 1k/sec
315TB
Read replica: up to
5-10x read throughput of primary
Single Kafka partition
100MB/s write, 500MB/s read
S3 throughput per prefix
5,500 GET/sec, 3,500 PUT/sec
Scroll to Top
This site uses cookies for analytics and to display ads via Google AdSense. By continuing to use the site you consent to our use of cookies. See our Privacy Policy for details and opt-out links.