System Design Interview: Email Service at Scale (SendGrid/Gmail)

Designing a transactional email service (like SendGrid or AWS SES) or an email client (like Gmail) involves deep distributed systems knowledge: message queuing, deliverability, inbox storage, and search. Both variants appear in senior engineering interviews.

Variant A: Transactional Email Sending Service (SendGrid-like)

Architecture

Application → Email API → Queue → SMTP Sender Pool → Internet MTAs

Components:
  Email API:    REST endpoint, validates, enqueues
  Queue:        Kafka topics partitioned by priority
  Sender Pool:  Workers that connect to recipient MTAs via SMTP
  Bounce Handler: Processes delivery failures
  Analytics:   Open tracking, click tracking

Queue Design by Priority

Kafka topics:
  email.transactional   → password resets, purchase receipts (< 5s)
  email.marketing       → newsletters, promotions (< 1hr acceptable)
  email.bulk            → cold outreach, low priority (hours)

Partitioning: hash(sender_domain) → consistent sending from same IP range
Consumer groups: dedicated pool per topic tier
  Transactional: 100 workers, auto-scale
  Marketing:     50 workers, batch
  Bulk:          20 workers, throttled to avoid blacklisting

Deliverability: The Hard Part

Email deliverability requirements:
  SPF:   TXT record listing authorized sending IPs
    "v=spf1 ip4:203.0.113.0/24 include:sendgrid.net ~all"

  DKIM:  Cryptographic signature on email headers
    Private key signs headers → recipient verifies via DNS public key
    Prevents spoofing: "From: security@yourbank.com"

  DMARC: Policy for SPF/DKIM failures
    "v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"
    p=none: monitor only; p=quarantine: spam folder; p=reject: block

  IP reputation:
    New sending IPs → warm up gradually (100 → 1K → 10K → 100K/day)
    Monitor bounce rate (< 2%), spam complaint rate (< 0.1%)
    Dedicated IPs for transactional (protect from marketing spam complaints)

  Suppression list:
    Maintain list of unsubscribed / hard-bounced / spam-complained emails
    Never send to suppression list — automatic filtering before queue entry

Bounce Handling

Soft bounce: temporary failure (mailbox full, server down)
  → Retry with exponential backoff: 5min, 30min, 2hr, 6hr, 24hr
  → Discard after 72 hours without success

Hard bounce: permanent failure (user doesn't exist, domain invalid)
  → Add to suppression list immediately
  → Never retry — continuing to send hard-bounced addresses → blacklisting

Feedback loops (FBL):
  Gmail, Yahoo, Outlook send complaints back via FBL
  → Remove from list, add to suppression
  → Track complaint rate by campaign/sender

Open/Click Tracking

Open tracking: embed 1x1 pixel image
  <img src="https://track.sendgrid.com/open/{encoded_email_id}" width="1" height="1">
  When client loads image → tracking server logs open event
  Limitation: iOS Mail Privacy Protection preloads all images

Click tracking: rewrite all links
  Original: https://example.com/product/123
  Rewritten: https://click.sendgrid.com/{encoded_link}?eid={email_id}
  On click: redirect to original, log click event

Events: {email_id, event_type, timestamp, user_agent, ip}
  → Kafka → Flink → ClickHouse (analytics) + Redis (real-time dashboard)

Variant B: Email Client / Inbox (Gmail-like)

Storage Model

Core entities:
  Users:    {user_id, email, quota_bytes}
  Threads:  {thread_id, subject, participants[], created_at}
  Messages: {message_id, thread_id, from, to[], cc[], body, size_bytes}
  Labels:   {label_id, user_id, name} -- INBOX, SENT, SPAM, custom
  MessageLabels: {message_id, user_id, label_id}

Storage design:
  Message bodies: object storage (S3 / GCS)
    Key: messages/{user_id}/{message_id}
    Content-Type: message/rfc822

  Metadata: relational DB (PostgreSQL / Spanner for global)
    Hot path: threads + messages for current user (< 10K rows typical)

  Attachments: separate object storage with CDN
    Key: attachments/{attachment_id}/{filename}
    Quota: 15GB per user (Gmail) → track per-user bytes in DB

Inbox Loading: Performance

GET /inbox (most common operation — must be fast):
  Query: threads with INBOX label, ordered by last_message_time DESC, LIMIT 50

  Without optimization:
    SELECT t.*, m.* FROM threads t
    JOIN messages m ON m.thread_id = t.id AND m.id = (
      SELECT id FROM messages WHERE thread_id = t.id ORDER BY ts DESC LIMIT 1
    )
    JOIN thread_labels tl ON tl.thread_id = t.id AND tl.label_id = INBOX
    WHERE tl.user_id = ?
    ORDER BY t.last_message_ts DESC LIMIT 50
    → Slow: nested selects, many joins

  Optimized with denormalization:
    threads table includes: snippet, last_message_ts, unread_count, participants_json
    → Single table scan, no joins for inbox listing
    → Update denormalized fields on each new message (async worker)

Search: Full-Text Search on Email

Gmail search: "from:alice subject:invoice after:2024-01-01"

Options:
  Option A: Elasticsearch
    Index: {message_id, user_id, from, subject, body, ts}
    Query: bool filter on user_id + full-text on body/subject
    Latency: 50-200ms
    Cost: significant (index = 2-3× raw storage size)

  Option B: Custom inverted index per user
    Build per-user inverted index: word → list of message IDs
    Store in user's namespace (Bigtable / Cassandra)
    Google's approach for scale + isolation

  Option C: CloudSearch / Typesense (simpler)
    Managed search; less control but faster to implement

Email-specific search optimizations:
  Sender search: "from:alice" → index FROM field separately for exact match
  Date range: partition index by year/month → prune partitions
  Attachment type: index MIME types for "has:attachment"

Push Notifications: New Email Delivery

SMTP inbound → Parse → Store message → Notify user

Notification channels:
  Web:    WebSocket (Gmail uses long-polling → updated with Push API)
  Mobile: APNs (iOS) / FCM (Android) → push notification
  Desktop: OS notification API

Inbound SMTP flow:
  MTA (Postfix) receives email → LMTP delivery to inbox service
  Inbox service:
    1. Spam/virus filtering (SpamAssassin, VirusTotal API)
    2. Apply user filter rules (if subject contains "receipt" → label Bills)
    3. Store message body to S3
    4. Store metadata to DB
    5. Publish "new_message" event to Redis Pub/Sub
    6. Push service: Redis subscriber → APNs/FCM notification

Interview Discussion Points

  • Why use object storage for email bodies? Email bodies are immutable blobs of variable size (1KB to 25MB with attachments). Object storage is cheap ($0.023/GB vs $0.10+/GB for DB), scales infinitely, and has built-in durability. Metadata (from, subject, labels, read status) is mutable and queried — that belongs in a relational DB.
  • How does Gmail achieve sub-second search on 15 years of email? Per-user inverted index stored in Bigtable, partitioned by time range. Queries are scoped to one user (no cross-user queries), making it a single-tenant search problem. The index is updated asynchronously on message receipt.
  • How to handle 99.9% email deliverability? Warm up sending IPs gradually, separate transactional from marketing IPs, monitor bounce/complaint rates obsessively (automate pausing campaigns above 0.1% complaint rate), maintain SPF/DKIM/DMARC, and use dedicated IPs for high-reputation senders.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between SPF, DKIM, and DMARC in email deliverability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SPF (Sender Policy Framework) is a DNS TXT record that lists IP addresses authorized to send email on behalf of your domain u2014 receiving servers check if the sending IP is in the SPF record. DKIM (DomainKeys Identified Mail) adds a cryptographic signature to email headers, signed with a private key, verifiable via a public key in DNS u2014 prevents message tampering and spoofing. DMARC (Domain-based Message Authentication, Reporting, and Conformance) builds on SPF and DKIM by specifying what to do when they fail: none (monitor), quarantine (spam folder), or reject (block). DMARC also enables aggregate reporting so you can see who’s sending email on your behalf. All three are required for reliable deliverability to Gmail, Yahoo, and Outlook.”
}
},
{
“@type”: “Question”,
“name”: “How does a transactional email service handle bounces and protect sender reputation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hard bounces (permanent failures: address doesn’t exist, domain invalid) must be added to a suppression list immediately and never retried u2014 continuing to send to hard-bounced addresses is a primary cause of IP blacklisting. Soft bounces (temporary: mailbox full, server temporarily unavailable) are retried with exponential backoff (5min u2192 30min u2192 2hr u2192 6hr u2192 24hr), then abandoned after 72 hours. Feedback loop (FBL) registrations with major ISPs deliver spam complaint notifications u2014 any address that marks email as spam should be added to the suppression list. Keeping bounce rate below 2% and complaint rate below 0.1% is essential for maintaining IP reputation and inbox placement.”
}
},
{
“@type”: “Question”,
“name”: “How does Gmail store and search through billions of emails efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Gmail stores email bodies in distributed object storage (similar to GFS/Colossus), with metadata (from, subject, labels, timestamps) in a distributed database (Spanner for global consistency). The inbox listing is optimized with denormalization: the threads table stores a precomputed snippet, last message timestamp, unread count, and participant list so inbox loading is a single table scan without joins. Full-text search uses a per-user inverted index stored in Bigtable, partitioned by time range u2014 since each user’s email is a single-tenant search problem, queries are scoped to one user’s index partitions, enabling sub-second search across years of email. Updates to the inverted index happen asynchronously after message receipt.”
}
}
]
}

  • Snap Interview Guide
  • Atlassian Interview Guide
  • Cloudflare Interview Guide
  • Shopify Interview Guide
  • Stripe Interview Guide
  • LinkedIn Interview Guide
  • Companies That Ask This

    Scroll to Top