System Design Interview: Design a Pastebin / Code Snippet Service

System Design Interview: Design a Pastebin / Code Snippet Service

Designing Pastebin is a classic beginner-to-intermediate system design question that covers URL shortening, content storage, access control, and expiration. Also relevant for designing code snippet sharing in IDEs (GitHub Gist, Carbon).

Requirements Clarification

Functional Requirements

  • Create a paste (text/code) and get a short shareable URL
  • View a paste by its short URL
  • Optional: set expiration time (1 hour, 1 day, 1 week, never)
  • Optional: set visibility (public, unlisted, private)
  • Optional: syntax highlighting by language
  • User accounts for managing own pastes

Non-Functional Requirements

  • Scale: 10M pastes/day, 100M reads/day (10:1 read:write ratio)
  • Storage: pastes up to 10MB, average 10KB, retain for up to 10 years
  • Read latency: <100ms
  • Availability: 99.9%

Short URL Generation

Generate a unique 6-8 character alphanumeric ID for each paste.

Option 1: Random ID

import secrets, string

def generate_id(length=7):
    chars = string.ascii_letters + string.digits  # 62 chars
    return ''.join(secrets.choice(chars) for _ in range(length))
# 62^7 = 3.5 trillion unique IDs - sufficient
# Check DB for collision before inserting (rare but possible)

Option 2: Hash-Based

import hashlib

def generate_id(content, user_id):
    data = content + str(user_id) + str(time.time())
    hash_val = hashlib.md5(data.encode()).hexdigest()
    return hash_val[:7]  # first 7 chars of MD5
# Birthday problem: collision probability rises with volume
# At 10M/day with 62^7 IDs: negligible collision rate

Option 3: Distributed ID Generator

Pre-generate batches of unique IDs using a counter service (like Twitter Snowflake). No collision risk. More complex but necessary at very high scale.

High-Level Architecture

User
  |
Load Balancer
  |
API Service
  |         |
Create    Read
  |         |
ID Gen    Cache (Redis)
  |         | miss
Paste DB  Paste DB
(Postgres) (read replica)
  |
Object Store (S3)
(for large pastes >1KB)

Storage Design

Database Schema

pastes:
  id          VARCHAR(8) PRIMARY KEY
  user_id     INT (nullable for anonymous)
  title       VARCHAR(255)
  language    VARCHAR(50)    -- syntax highlighting
  visibility  ENUM('public', 'unlisted', 'private')
  size_bytes  INT
  content_key VARCHAR(255)   -- S3 key if stored externally
  content     TEXT           -- inline if <1KB
  created_at  TIMESTAMP
  expires_at  TIMESTAMP (nullable)

Hybrid Storage

  • Small pastes (<1KB): store inline in DB TEXT column for fast retrieval
  • Large pastes (>1KB): store in S3, keep content_key in DB
  • CDN in front of S3 for public pastes (cache aggressively with long TTL)

Caching Strategy

# Cache paste by ID in Redis
# Write-through: cache on create, read from cache first
def get_paste(paste_id):
    cached = redis.get(f"paste:{paste_id}")
    if cached: return json.loads(cached)

    paste = db.query("SELECT * FROM pastes WHERE id = ?", paste_id)
    if not paste: return None

    if paste.expires_at and paste.expires_at < now():
        return None  # expired

    ttl = min(3600, (paste.expires_at - now()).seconds) if paste.expires_at else 3600
    redis.setex(f"paste:{paste_id}", ttl, json.dumps(paste))
    return paste

Expiration Handling

  • Redis TTL: set TTL on cache entry equal to paste expiry. Cache auto-expires.
  • DB expiry: check expires_at on every read (lazy expiration). Background cron job deletes expired rows weekly to reclaim storage.
  • Soft delete: mark expired pastes instead of deleting, for potential undelete feature.

Access Control

  • Public: anyone can view, appears in search/explore
  • Unlisted: viewable by anyone with the link, not in search (like YouTube unlisted)
  • Private: only creator can view (requires auth check)

For private pastes, validate user session on every read request. Do not cache private pastes in shared cache without user isolation.

Analytics (Optional)

  • View count per paste: Redis INCR, batch write to DB hourly
  • Popular pastes: sorted set by view count, ZREVRANGE for top-K
  • Referrer tracking: log referrer header, aggregate by Kafka consumer

Interview Tips

  • This is a simpler URL shortener variant – design it cleanly in 20 minutes
  • Discuss storage choice: inline for small pastes, S3 for large
  • Expiration: lazy check on read + background cleanup
  • Caching with Redis is essential given 10:1 read:write ratio
  • Mention CDN for public pastes to reduce DB load
  • Discuss access control: public vs unlisted vs private


Companies that ask this: Cloudflare Interview Guide 2026: Networking, Edge Computing, and CDN Design

Companies that ask this: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

Companies that ask this: Twitter/X Interview Guide 2026: Timeline Algorithms, Real-Time Search, and Content at Scale

Companies that ask this: Shopify Interview Guide

Companies that ask this: Snap Interview Guide

Companies that ask this: Atlassian Interview Guide

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you generate unique short IDs for a Pastebin service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Three approaches: (1) Random ID – generate 7-character base62 string (letters + digits = 62^7 = 3.5 trillion IDs). Check for collision before inserting (extremely rare). Use cryptographically secure random (secrets.choice in Python). (2) Hash-based – MD5/SHA256 of content+user+timestamp, take first 7 chars. (3) Counter-based – monotonically increasing ID from a distributed counter (Redis INCR or Snowflake). Random ID is simplest and sufficient for most scales. Counter-based for very high throughput with no collision risk.”
}
},
{
“@type”: “Question”,
“name”: “Should paste content be stored in the database or object storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a hybrid approach: small pastes (under 1KB) stored inline in the database TEXT column for fast retrieval without an additional S3 round-trip. Large pastes (over 1KB) stored in S3 with a content_key reference in the database. S3 is 10x cheaper per GB than database storage and scales to exabytes. CDN caches public paste content from S3 with long TTL (content is immutable once created). The database stores metadata (ID, title, language, visibility, expiry) regardless of where content lives.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle paste expiration efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Two-tier expiration: (1) Redis TTL – set cache entry TTL to min(paste_expiry, cache_max_ttl). Expired pastes automatically evicted from cache. (2) Lazy DB expiration – check expires_at on every read request and return 404 if expired. This avoids expensive background scan on large tables. (3) Background cleanup – a cron job runs weekly to DELETE FROM pastes WHERE expires_at < NOW() to reclaim storage space. This three-layer approach keeps reads fast, avoids blocking writes, and eventually reclaims disk space."
}
},
{
"@type": "Question",
"name": "How do you design access control for public, unlisted, and private pastes?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Three visibility levels: Public – anyone can view, indexed in search, cached by CDN. Unlisted – viewable by anyone with the URL (no authentication check), but not in search results or explore pages (like YouTube unlisted videos). The random ID provides security through obscurity. Private – requires authentication and ownership check. Never cache private pastes in shared CDN or Redis without user isolation (cache key must include user ID). Store visibility in the pastes table and check on every read for private pastes."
}
},
{
"@type": "Question",
"name": "How do you scale Pastebin to handle 100M reads per day?",
"acceptedAnswer": {
"@type": "Answer",
"text": "100M reads/day = ~1,160 reads/second average, higher at peak. Strategy: (1) Redis cache: cache paste metadata and small content with 1-hour TTL. Cache hit rate should exceed 90% for popular pastes. (2) CDN: cache public paste HTML pages and raw content. Static content (syntax-highlighted HTML) cached at edge with long TTL. (3) Read replicas: route read queries to PostgreSQL read replicas, writes to primary. (4) Horizontal API scaling: stateless API servers behind load balancer, scale horizontally. (5) Rate limiting per IP to prevent abuse."
}
}
]
}

Scroll to Top