System Design Interview: Design an Email System (like Gmail)
Designing an email system covers distributed storage, message queuing, full-text search, spam filtering, and protocol handling (SMTP, IMAP, POP3). Asked at Google, Microsoft, and Yahoo in various forms.
Requirements Clarification
Functional Requirements
- Send and receive emails (with attachments up to 25MB)
- Organize emails: inbox, sent, drafts, spam, labels/folders
- Search emails by sender, subject, body content, date
- Threading: group related emails into conversations
- Spam and phishing detection
- Notifications for new emails
Non-Functional Requirements
- Scale: 2B users (Gmail scale), 300B emails/day
- Storage: 15GB free per user, 2B users = 30EB total
- Latency: email delivery <1s within same domain, <5s cross-domain
- Availability: 99.9% (email is not real-time; brief delay acceptable)
Email Protocols
- SMTP (Simple Mail Transfer Protocol): sending email between servers. Port 587 (submission), 465 (SMTPS). Used for outbound delivery.
- IMAP (Internet Message Access Protocol): client accesses mailbox on server. Email stays on server, synced across devices. Modern standard.
- POP3 (Post Office Protocol): downloads email to local client, deletes from server. Legacy, mostly replaced by IMAP.
- SPF/DKIM/DMARC: authentication standards to prevent email spoofing and enable spam filtering.
High-Level Architecture
Sender (Gmail/Outlook/etc)
|
SMTP Relay (outbound MTA)
- SPF/DKIM signing
- DNS MX record lookup for recipient domain
|
SMTP Gateway (inbound MTA for techinterview.org)
- SPF/DKIM/DMARC validation
- Rate limiting, IP reputation check
|
Spam Filter (ML classifier)
|
Message Queue (Kafka)
|
Email Storage Service -> Object Store (S3) for raw emails
-> Metadata DB (Cassandra): headers, labels, read status
-> Search Index (Elasticsearch)
|
IMAP Server -> User clients (Gmail app, Outlook, Apple Mail)
Email Storage Design
Raw Email (MIME)
Store raw MIME format in object storage (S3). Key: {user_id}/{email_id}. Compress with gzip. Attachments stored separately; email body references attachment by key. Deduplication: if 1M users receive the same mass email, store one copy and reference from all inboxes.
Metadata Database (Cassandra)
# Email metadata - optimized for mailbox queries
emails_by_user: (user_id, folder, received_at DESC) -> (email_id, from, subject, snippet, is_read, has_attachment, label_ids)
# Primary key allows: list emails in folder sorted by date
# Secondary index on email_id for direct lookup
# Label table: many-to-many between emails and labels
Email Threading
Group related emails into conversations. Algorithm:
- Extract In-Reply-To and References headers
- If In-Reply-To matches an existing email_id, add to that thread
- Otherwise check References header for ancestors
- If no match, create new thread (use subject similarity as fallback)
- Thread ID stored per email; mailbox queries group by thread_id
Full-Text Search
Users search across billions of emails per mailbox. Elasticsearch with per-user index sharding:
- Index: sender, recipients, subject, body text, attachment names
- User-isolated indices (shard by user_id range) for security isolation
- Incremental indexing via Kafka consumer as emails arrive
- Query: parse search query, apply user_id filter, return email IDs, hydrate from metadata DB
Spam and Phishing Detection
Multi-layer defense:
- DNS checks: SPF (sender IP authorized?), DKIM (signature valid?), DMARC (domain policy?)
- IP reputation: block known spam IPs (public RBL lists + internal reputation)
- Content filtering: URL blacklists, known phishing signatures
- ML classifier: Naive Bayes or gradient boosting on email features (sender history, content patterns, header anomalies). Trained on labeled spam/ham corpus.
- User feedback loop: “Report spam” trains the model per-user and globally
Attachment Handling
- Attachments stored in S3 separately from email body
- Virus scan (ClamAV or commercial) before storing
- Preview generation: PDF thumbnail, image resize
- Maximum size: 25MB per email (Gmail limit)
- Google Drive link substitution for large files: “too large to attach” → share link
Delivery and Notifications
- Push notifications: FCM/APNs for mobile apps on new email
- IMAP IDLE: server pushes new email event to connected IMAP clients
- Delivery receipts: track if email was delivered, opened (1px tracking pixel)
- Bounce handling: SMTP 5xx = permanent failure (invalid address); 4xx = temporary (retry)
Interview Tips
- Know SMTP/IMAP protocols and when each is used
- Discuss email deduplication for mass emails (store once, reference many)
- Explain threading via In-Reply-To and References headers
- Mention SPF/DKIM/DMARC as anti-spam foundation
- Describe Cassandra schema for mailbox queries (user + folder + date)
- Elasticsearch for full-text search with user isolation