Q: How do you design access control for public, unlisted, and private pastes?

Three visibility levels: Public - anyone can view, indexed in search, cached by CDN. Unlisted - viewable by anyone with the URL (no authentication check), but not in search results or explore pages (like YouTube unlisted videos). The random ID provides security through obscurity. Private - requires authentication and ownership check. Never cache private pastes in shared CDN or Redis without user isolation (cache key must include user ID). Store visibility in the pastes table and check on every read for private pastes.

Q: How do you scale Pastebin to handle 100M reads per day?

100M reads/day = ~1,160 reads/second average, higher at peak. Strategy: (1) Redis cache: cache paste metadata and small content with 1-hour TTL. Cache hit rate should exceed 90% for popular pastes. (2) CDN: cache public paste HTML pages and raw content. Static content (syntax-highlighted HTML) cached at edge with long TTL. (3) Read replicas: route read queries to PostgreSQL read replicas, writes to primary. (4) Horizontal API scaling: stateless API servers behind load balancer, scale horizontally. (5) Rate limiting per IP to prevent abuse.

Question 1

How do you generate unique short IDs for a Pastebin service?

Accepted Answer

Three approaches: (1) Random ID - generate 7-character base62 string (letters + digits = 62^7 = 3.5 trillion IDs). Check for collision before inserting (extremely rare). Use cryptographically secure random (secrets.choice in Python). (2) Hash-based - MD5/SHA256 of content+user+timestamp, take first 7 chars. (3) Counter-based - monotonically increasing ID from a distributed counter (Redis INCR or Snowflake). Random ID is simplest and sufficient for most scales. Counter-based for very high throughput with no collision risk.

Question 2

Should paste content be stored in the database or object storage?

Accepted Answer

Use a hybrid approach: small pastes (under 1KB) stored inline in the database TEXT column for fast retrieval without an additional S3 round-trip. Large pastes (over 1KB) stored in S3 with a content_key reference in the database. S3 is 10x cheaper per GB than database storage and scales to exabytes. CDN caches public paste content from S3 with long TTL (content is immutable once created). The database stores metadata (ID, title, language, visibility, expiry) regardless of where content lives.

Question 3

How do you handle paste expiration efficiently?

Accepted Answer

Two-tier expiration: (1) Redis TTL - set cache entry TTL to min(paste_expiry, cache_max_ttl). Expired pastes automatically evicted from cache. (2) Lazy DB expiration - check expires_at on every read request and return 404 if expired. This avoids expensive background scan on large tables. (3) Background cleanup - a cron job runs weekly to DELETE FROM pastes WHERE expires_at < NOW() to reclaim storage space. This three-layer approach keeps reads fast, avoids blocking writes, and eventually reclaims disk space.

Question 4

How do you design access control for public, unlisted, and private pastes?

Accepted Answer

Three visibility levels: Public - anyone can view, indexed in search, cached by CDN. Unlisted - viewable by anyone with the URL (no authentication check), but not in search results or explore pages (like YouTube unlisted videos). The random ID provides security through obscurity. Private - requires authentication and ownership check. Never cache private pastes in shared CDN or Redis without user isolation (cache key must include user ID). Store visibility in the pastes table and check on every read for private pastes.

Question 5

How do you scale Pastebin to handle 100M reads per day?

Accepted Answer

100M reads/day = ~1,160 reads/second average, higher at peak. Strategy: (1) Redis cache: cache paste metadata and small content with 1-hour TTL. Cache hit rate should exceed 90% for popular pastes. (2) CDN: cache public paste HTML pages and raw content. Static content (syntax-highlighted HTML) cached at edge with long TTL. (3) Read replicas: route read queries to PostgreSQL read replicas, writes to primary. (4) Horizontal API scaling: stateless API servers behind load balancer, scale horizontally. (5) Rate limiting per IP to prevent abuse.

System Design Interview: Design a Pastebin / Code Snippet Service

System Design Interview: Design a Pastebin / Code Snippet Service

Requirements Clarification

Functional Requirements

Non-Functional Requirements

Short URL Generation

Option 1: Random ID

Option 2: Hash-Based

Option 3: Distributed ID Generator

High-Level Architecture

Storage Design

Database Schema

Hybrid Storage

Caching Strategy

Expiration Handling

Access Control

Analytics (Optional)

Interview Tips