System Design Interview: Design an Email System (Gmail)

System Design Interview: Design an Email System (like Gmail)

Designing an email system covers distributed storage, message queuing, full-text search, spam filtering, and protocol handling (SMTP, IMAP, POP3). Asked at Google, Microsoft, and Yahoo in various forms.

Requirements Clarification

Functional Requirements

  • Send and receive emails (with attachments up to 25MB)
  • Organize emails: inbox, sent, drafts, spam, labels/folders
  • Search emails by sender, subject, body content, date
  • Threading: group related emails into conversations
  • Spam and phishing detection
  • Notifications for new emails

Non-Functional Requirements

  • Scale: 2B users (Gmail scale), 300B emails/day
  • Storage: 15GB free per user, 2B users = 30EB total
  • Latency: email delivery <1s within same domain, <5s cross-domain
  • Availability: 99.9% (email is not real-time; brief delay acceptable)

Email Protocols

  • SMTP (Simple Mail Transfer Protocol): sending email between servers. Port 587 (submission), 465 (SMTPS). Used for outbound delivery.
  • IMAP (Internet Message Access Protocol): client accesses mailbox on server. Email stays on server, synced across devices. Modern standard.
  • POP3 (Post Office Protocol): downloads email to local client, deletes from server. Legacy, mostly replaced by IMAP.
  • SPF/DKIM/DMARC: authentication standards to prevent email spoofing and enable spam filtering.

High-Level Architecture

Sender (Gmail/Outlook/etc)
  |
SMTP Relay (outbound MTA)
  - SPF/DKIM signing
  - DNS MX record lookup for recipient domain
  |
SMTP Gateway (inbound MTA for techinterview.org)
  - SPF/DKIM/DMARC validation
  - Rate limiting, IP reputation check
  |
Spam Filter (ML classifier)
  |
Message Queue (Kafka)
  |
Email Storage Service -> Object Store (S3) for raw emails
                      -> Metadata DB (Cassandra): headers, labels, read status
                      -> Search Index (Elasticsearch)
  |
IMAP Server -> User clients (Gmail app, Outlook, Apple Mail)

Email Storage Design

Raw Email (MIME)

Store raw MIME format in object storage (S3). Key: {user_id}/{email_id}. Compress with gzip. Attachments stored separately; email body references attachment by key. Deduplication: if 1M users receive the same mass email, store one copy and reference from all inboxes.

Metadata Database (Cassandra)

# Email metadata - optimized for mailbox queries
emails_by_user: (user_id, folder, received_at DESC) -> (email_id, from, subject, snippet, is_read, has_attachment, label_ids)

# Primary key allows: list emails in folder sorted by date
# Secondary index on email_id for direct lookup
# Label table: many-to-many between emails and labels

Email Threading

Group related emails into conversations. Algorithm:

  • Extract In-Reply-To and References headers
  • If In-Reply-To matches an existing email_id, add to that thread
  • Otherwise check References header for ancestors
  • If no match, create new thread (use subject similarity as fallback)
  • Thread ID stored per email; mailbox queries group by thread_id

Full-Text Search

Users search across billions of emails per mailbox. Elasticsearch with per-user index sharding:

  • Index: sender, recipients, subject, body text, attachment names
  • User-isolated indices (shard by user_id range) for security isolation
  • Incremental indexing via Kafka consumer as emails arrive
  • Query: parse search query, apply user_id filter, return email IDs, hydrate from metadata DB

Spam and Phishing Detection

Multi-layer defense:

  1. DNS checks: SPF (sender IP authorized?), DKIM (signature valid?), DMARC (domain policy?)
  2. IP reputation: block known spam IPs (public RBL lists + internal reputation)
  3. Content filtering: URL blacklists, known phishing signatures
  4. ML classifier: Naive Bayes or gradient boosting on email features (sender history, content patterns, header anomalies). Trained on labeled spam/ham corpus.
  5. User feedback loop: “Report spam” trains the model per-user and globally

Attachment Handling

  • Attachments stored in S3 separately from email body
  • Virus scan (ClamAV or commercial) before storing
  • Preview generation: PDF thumbnail, image resize
  • Maximum size: 25MB per email (Gmail limit)
  • Google Drive link substitution for large files: “too large to attach” → share link

Delivery and Notifications

  • Push notifications: FCM/APNs for mobile apps on new email
  • IMAP IDLE: server pushes new email event to connected IMAP clients
  • Delivery receipts: track if email was delivered, opened (1px tracking pixel)
  • Bounce handling: SMTP 5xx = permanent failure (invalid address); 4xx = temporary (retry)

Interview Tips

  • Know SMTP/IMAP protocols and when each is used
  • Discuss email deduplication for mass emails (store once, reference many)
  • Explain threading via In-Reply-To and References headers
  • Mention SPF/DKIM/DMARC as anti-spam foundation
  • Describe Cassandra schema for mailbox queries (user + folder + date)
  • Elasticsearch for full-text search with user isolation

Scroll to Top