Low Level Design: Pastebin Service

Q: What is a pastebin service in system design?

A pastebin service lets users upload and share text via a short URL. Core components include a short-ID generator, metadata store, content store, and a fast read path. Pastes support configurable TTLs and visibility controls.

Q: Why store paste content in object storage instead of a database?

Paste bodies are variable-length blobs that bloat database rows, degrade indexes, and waste buffer pool memory. Object storage (S3, GCS) is purpose-built for blobs, scales cheaply, and integrates with CDNs. Keep only metadata in the database and the raw body in object storage.

Q: How do you implement paste expiry at scale?

Combine lazy expiry (check timestamp on read, return 404 and enqueue deletion if expired) with active expiry (background worker queries for expired pastes using an index or Redis sorted set and deletes in batches). Redis TTL handles cache-layer expiry automatically.

Q: How do you prevent abuse and detect malicious content in a pastebin?

Use rate limiting by IP/user, SHA-256 content hashing against blocklists, pattern scanning with regex or ML classifiers, CAPTCHA on anonymous uploads, abuse reporting queues, and Safe Browsing API checks on embedded URLs. Combine synchronous quick checks with async deep-scan pipelines.

⏱ 6 min read

Low Level Design: Pastebin Service

A Pastebin service lets users store and share text (code, configs, logs) via a short URL. The design separates metadata from content, stores large text blobs in object storage, and supports visibility controls, syntax highlighting, expiry, and search.

Paste Schema

The database stores only metadata. The actual paste content lives in object storage (S3 or compatible).

CREATE TABLE pastes (
  id           BIGINT       PRIMARY KEY AUTO_INCREMENT,
  short_id     VARCHAR(12)  UNIQUE NOT NULL,
  content_url  TEXT         NOT NULL,
  title        VARCHAR(255),
  language     VARCHAR(50),
  owner_id     BIGINT,
  visibility   ENUM('public','unlisted','private') DEFAULT 'public',
  expires_at   DATETIME,
  view_count   BIGINT       DEFAULT 0,
  password_hash VARCHAR(255)
);

content_url points to the object in S3 (e.g., s3://pastes-bucket/ab/cd1234.txt). Storing content in the DB would make it hard to scale; S3 handles large blobs cheaply and scales independently.

Content Storage in Object Storage

When a paste is created: write the content to S3 at a key derived from the paste ID (e.g., /pastes/00/01/0001f3a.txt), store the S3 key as content_url in the DB. When a paste is viewed: fetch metadata from DB, then fetch content from S3 (or serve via a signed CDN URL). This separation means the DB stays small and fast even with pastes up to 10MB each.

Unique Short ID Generation

Use Base62 encoding of the auto-increment id. Since id is a DB sequence, uniqueness is guaranteed without collision checks. For a 6-character code, Base62 supports over 56 billion distinct values. Alternatively, use a random nanoid with retry-on-collision. Pre-warm a pool of short IDs in Redis for high-throughput scenarios.

Syntax Highlighting

Two approaches:

Client-side with Prism.js: Send the raw text and language tag to the browser. Prism.js highlights in JavaScript. Zero server cost, works for all languages Prism supports. Slightly slower first render.
Server-side with Pygments or highlight.js (Node): Render highlighted HTML on the server, cache the result in S3 alongside the raw content. Faster page load, pre-renderable, better for SEO.

Store the detected or user-specified language in the DB. If the user does not specify, auto-detect using a library like linguist or guesslang.

Visibility Levels

Public: Listed in public feeds, indexed by search engines, searchable within the site.
Unlisted: Accessible via direct link only, not indexed or listed. Share the URL to share the paste.
Private: Only the owner can view. Requires authentication to access.

Password-Protected Pastes

Store a bcrypt hash of the password in password_hash. On access, if password_hash is set, present a password form. Verify the submitted password against the hash before serving content. Issue a session token or signed cookie so the user does not have to re-enter the password on refresh.

Expiry and Cleanup

Return 410 Gone for expired pastes at read time. Run a scheduled job to delete expired rows from the DB and remove the corresponding S3 objects. Index expires_at for efficient range queries. For soft-delete, mark is_deleted = TRUE and run the S3 cleanup asynchronously to avoid blocking the DB job.

Search

Full-text search over paste titles and content previews via Elasticsearch. Index: paste short_id, title, first 200 characters of content (preview text), language, created_at, and visibility. Only index public pastes. Sync to Elasticsearch asynchronously via a queue after paste creation. Support filtering by language and sorting by recency or view count.

Abuse Detection

Content hash blocklist: SHA-256 the raw paste content. Check against a blocklist of known-bad content hashes (CSAM hashes, leaked credential dumps). Reject on match.
User report queue: Allow visitors to flag pastes. Flagged pastes enter a review queue. Auto-hide after N reports pending review.
Rate limiting: Limit paste creation per IP and per user. Use Redis token bucket. Anonymous users get a lower limit than authenticated users.

Summary

Key design decisions: metadata in DB + content in S3, Base62 short IDs from auto-increment, client-side syntax highlighting with Prism.js for simplicity, Elasticsearch for full-text search over public pastes, and content hash blocklist for abuse prevention.

Frequently Asked Questions

What is a pastebin service in system design?

A pastebin service allows users to upload and share arbitrary text (code snippets, logs, notes) via a short URL. Core design components include: a short-ID generator (similar to a URL shortener), a metadata store (author, creation time, expiry, visibility, language), a content store for the paste body, and a read path that serves content quickly at scale. Pastes may be public, private (by secret URL), or password-protected, and they typically support configurable TTLs after which content is deleted.

Why store paste content in object storage instead of a database?

Paste bodies are variable-length blobs that can range from a few bytes to several megabytes. Storing large blobs in a relational database inflates row sizes, degrades index performance, wastes buffer pool memory, and makes backups slow. Object storage (S3, GCS) is purpose-built for unstructured blobs: it scales to exabytes, charges only for bytes stored, and delivers content efficiently via CDN integration. The canonical pattern is to store only metadata (paste ID, owner, expiry, content type, S3 key) in the database and the raw body in object storage, fetching the blob directly or via a CDN-signed URL at read time.

How do you implement paste expiry at scale?

Two complementary approaches are used together. Lazy expiry: on every read, check the expiry timestamp in the metadata store; if expired, return 404 and enqueue a deletion job rather than deleting synchronously. Active expiry: a background worker queries for pastes whose expires_at <= now() (using a database index or a Redis sorted set keyed by expiry time), deletes the object-storage blob, and removes the metadata row in batches. Redis TTL can handle expiry for the cache layer automatically. This combination avoids expensive full-table scans and keeps the hot path fast while ensuring eventual cleanup of expired data.

How do you prevent abuse and detect malicious content in a pastebin?

A layered approach is most effective: (1) Rate limiting by IP and authenticated user to throttle bulk uploads; (2) Content hashing — compute a SHA-256 of each paste body and check it against a blocklist of known-malicious hashes (similar to PhotoDNA or Google Safe Browsing hash lists); (3) Pattern scanning — regex or ML-based classifiers to detect credentials, spam, or malware URLs; (4) CAPTCHA on anonymous uploads to prevent bot submissions; (5) Abuse reporting — human review queue where flagged pastes are taken offline pending review; (6) Short URL defanging — scan links in pastes against Safe Browsing APIs before serving. Combine synchronous checks for quick wins with async scanning pipelines for deeper analysis without blocking the write path.