Requirements
- Send transactional emails (password reset, order confirmation, notifications) and marketing emails (campaigns, newsletters)
- 100M emails/day, peak 10K/second during campaigns
- Delivery within 30 seconds for transactional, best-effort for marketing
- Track delivery status: sent, delivered, bounced, spam-complained, opened, clicked
- Respect unsubscribes; maintain sender reputation
Architecture
Service → Email API → Kafka (email_jobs topic)
→ Email Worker Pool
→ SMTP Relay (SendGrid / SES / Mailgun)
→ Delivery Status Webhook from SMTP relay
→ Status DB (PostgreSQL)
→ Event Kafka (email_events) → Analytics
Data Model
EmailJob(job_id UUID, type ENUM(TRANSACTIONAL,MARKETING), recipient_id,
to_address, from_address, subject, template_id, template_vars JSONB,
scheduled_at, status ENUM(PENDING,SENT,DELIVERED,BOUNCED,FAILED),
provider_message_id, created_at, sent_at)
EmailTemplate(template_id, name, subject_template, html_template, text_template,
version INT, created_at)
EmailSuppression(email_address, reason ENUM(BOUNCE,SPAM,UNSUBSCRIBE),
added_at, source)
Transactional vs Marketing Email
Transactional emails (triggered by user action): highest priority, sent immediately, not subject to unsubscribe (legal exception in most jurisdictions). Examples: password reset, order shipped, 2FA code. Marketing emails (promotional): lower priority, respect unsubscribe, rate-limited to avoid spam classification, sent in batches during business hours. Keep transactional and marketing in separate Kafka topics (different consumer groups with different priorities) and use separate sending domains and IPs (protect transactional reputation from marketing complaints).
Template Rendering
Render email templates at send time, not at job creation time. This allows template updates to take effect on scheduled campaigns. Template engine: Jinja2/Handlebars — render HTML with template_vars. Always generate both HTML and plaintext versions (some clients prefer plaintext; improves spam score). Inline CSS (many email clients strip <style> blocks): use a CSS inliner before sending. Track links: replace all URLs with tracking URLs (https://track.example.com/c/{job_id}/{link_hash}) for click tracking.
Sending and Reputation Management
Email reputation is fragile — a high bounce or spam complaint rate causes ISPs to block your domain. Key practices:
- Bounce handling: hard bounces (permanent: address doesn’t exist) → add to suppression list immediately. Soft bounces (temporary: mailbox full) → retry 3 times over 24h, then suppress.
- Spam complaint handling: mailbox providers send complaints via FBL (Feedback Loop). Add complainers to suppression list; never email them again.
- Suppression list check: before sending, check if the recipient is in EmailSuppression. Reject the job without sending.
- Rate limiting per domain: warm up new IPs gradually. Limit to 1K/hour for new IPs, increasing over weeks.
- SPF, DKIM, DMARC: authenticate outbound email to prevent spoofing and improve deliverability.
Delivery Status Tracking
SMTP relays (SendGrid, SES) send webhooks on delivery events: delivered, bounced, spam_report, open, click. Ingest these via a webhook endpoint → Kafka → status processor → update EmailJob.status. Store all events in an EmailEvent table for analytics. Open tracking: embed a 1×1 pixel image with a unique URL (https://track.example.com/o/{job_id}) — when the email client loads the image, the open is recorded. Note: Apple Mail Privacy Protection pre-fetches tracking pixels, inflating open rates — treat opens as an unreliable metric.
Campaign Sending
Marketing campaigns send to millions of recipients. Never send all at once — ISPs rate-limit bulk senders. Campaign scheduler: send at 10K/minute, distributed across 30-60 minutes. Use time zone targeting: deliver at 10am local time for each recipient. Batch job: SELECT recipient_ids FROM campaign_recipients WHERE campaign_id=X AND sent_at IS NULL LIMIT 1000, process in chunks. Mark sent_at to prevent re-sending on restart (idempotent job).
Key Design Decisions
- Separate sending infrastructure for transactional vs marketing — reputation isolation
- Suppression list check before every send — CAN-SPAM/GDPR compliance
- Webhook status callbacks from relay — async delivery confirmation, no polling
- Render templates at send time, not creation time — enables template updates for scheduled campaigns
- Gradual campaign sending — protects sender reputation, respects ISP rate limits
Shopify system design covers transactional email and notification delivery. See common questions for Shopify interview: email and notification delivery system design.
Amazon system design covers large-scale email and notification delivery. Review patterns for Amazon interview: email delivery and notification system design.
LinkedIn system design covers email notification and messaging at scale. See design patterns for LinkedIn interview: email and messaging system design.