Low Level Design: Pastebin Service
A Pastebin service lets users store and share text (code, configs, logs) via a short URL. The design separates metadata from content, stores large text blobs in object storage, and supports visibility controls, syntax highlighting, expiry, and search.
Paste Schema
The database stores only metadata. The actual paste content lives in object storage (S3 or compatible).
CREATE TABLE pastes (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
short_id VARCHAR(12) UNIQUE NOT NULL,
content_url TEXT NOT NULL,
title VARCHAR(255),
language VARCHAR(50),
owner_id BIGINT,
visibility ENUM('public','unlisted','private') DEFAULT 'public',
expires_at DATETIME,
view_count BIGINT DEFAULT 0,
password_hash VARCHAR(255)
);
content_url points to the object in S3 (e.g., s3://pastes-bucket/ab/cd1234.txt). Storing content in the DB would make it hard to scale; S3 handles large blobs cheaply and scales independently.
Content Storage in Object Storage
When a paste is created: write the content to S3 at a key derived from the paste ID (e.g., /pastes/00/01/0001f3a.txt), store the S3 key as content_url in the DB. When a paste is viewed: fetch metadata from DB, then fetch content from S3 (or serve via a signed CDN URL). This separation means the DB stays small and fast even with pastes up to 10MB each.
Unique Short ID Generation
Use Base62 encoding of the auto-increment id. Since id is a DB sequence, uniqueness is guaranteed without collision checks. For a 6-character code, Base62 supports over 56 billion distinct values. Alternatively, use a random nanoid with retry-on-collision. Pre-warm a pool of short IDs in Redis for high-throughput scenarios.
Syntax Highlighting
Two approaches:
- Client-side with Prism.js: Send the raw text and language tag to the browser. Prism.js highlights in JavaScript. Zero server cost, works for all languages Prism supports. Slightly slower first render.
- Server-side with Pygments or highlight.js (Node): Render highlighted HTML on the server, cache the result in S3 alongside the raw content. Faster page load, pre-renderable, better for SEO.
Store the detected or user-specified language in the DB. If the user does not specify, auto-detect using a library like linguist or guesslang.
Visibility Levels
- Public: Listed in public feeds, indexed by search engines, searchable within the site.
- Unlisted: Accessible via direct link only, not indexed or listed. Share the URL to share the paste.
- Private: Only the owner can view. Requires authentication to access.
Password-Protected Pastes
Store a bcrypt hash of the password in password_hash. On access, if password_hash is set, present a password form. Verify the submitted password against the hash before serving content. Issue a session token or signed cookie so the user does not have to re-enter the password on refresh.
Expiry and Cleanup
Return 410 Gone for expired pastes at read time. Run a scheduled job to delete expired rows from the DB and remove the corresponding S3 objects. Index expires_at for efficient range queries. For soft-delete, mark is_deleted = TRUE and run the S3 cleanup asynchronously to avoid blocking the DB job.
Search
Full-text search over paste titles and content previews via Elasticsearch. Index: paste short_id, title, first 200 characters of content (preview text), language, created_at, and visibility. Only index public pastes. Sync to Elasticsearch asynchronously via a queue after paste creation. Support filtering by language and sorting by recency or view count.
Abuse Detection
- Content hash blocklist: SHA-256 the raw paste content. Check against a blocklist of known-bad content hashes (CSAM hashes, leaked credential dumps). Reject on match.
- User report queue: Allow visitors to flag pastes. Flagged pastes enter a review queue. Auto-hide after N reports pending review.
- Rate limiting: Limit paste creation per IP and per user. Use Redis token bucket. Anonymous users get a lower limit than authenticated users.
Summary
Key design decisions: metadata in DB + content in S3, Base62 short IDs from auto-increment, client-side syntax highlighting with Prism.js for simplicity, Elasticsearch for full-text search over public pastes, and content hash blocklist for abuse prevention.
Frequently Asked Questions
What is a pastebin service in system design?
A pastebin service allows users to upload and share arbitrary text (code snippets, logs, notes) via a short URL. Core design components include: a short-ID generator (similar to a URL shortener), a metadata store (author, creation time, expiry, visibility, language), a content store for the paste body, and a read path that serves content quickly at scale. Pastes may be public, private (by secret URL), or password-protected, and they typically support configurable TTLs after which content is deleted.
Why store paste content in object storage instead of a database?
Paste bodies are variable-length blobs that can range from a few bytes to several megabytes. Storing large blobs in a relational database inflates row sizes, degrades index performance, wastes buffer pool memory, and makes backups slow. Object storage (S3, GCS) is purpose-built for unstructured blobs: it scales to exabytes, charges only for bytes stored, and delivers content efficiently via CDN integration. The canonical pattern is to store only metadata (paste ID, owner, expiry, content type, S3 key) in the database and the raw body in object storage, fetching the blob directly or via a CDN-signed URL at read time.
How do you implement paste expiry at scale?
Two complementary approaches are used together. Lazy expiry: on every read, check the expiry timestamp in the metadata store; if expired, return 404 and enqueue a deletion job rather than deleting synchronously. Active expiry: a background worker queries for pastes whose expires_at <= now() (using a database index or a Redis sorted set keyed by expiry time), deletes the object-storage blob, and removes the metadata row in batches. Redis TTL can handle expiry for the cache layer automatically. This combination avoids expensive full-table scans and keeps the hot path fast while ensuring eventual cleanup of expired data.
How do you prevent abuse and detect malicious content in a pastebin?
A layered approach is most effective: (1) Rate limiting by IP and authenticated user to throttle bulk uploads; (2) Content hashing — compute a SHA-256 of each paste body and check it against a blocklist of known-malicious hashes (similar to PhotoDNA or Google Safe Browsing hash lists); (3) Pattern scanning — regex or ML-based classifiers to detect credentials, spam, or malware URLs; (4) CAPTCHA on anonymous uploads to prevent bot submissions; (5) Abuse reporting — human review queue where flagged pastes are taken offline pending review; (6) Short URL defanging — scan links in pastes against Safe Browsing APIs before serving. Combine synchronous checks for quick wins with async scanning pipelines for deeper analysis without blocking the write path.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide