System Design: YouTube / Video Streaming Platform

Video streaming is one of the most infrastructure-intensive system design problems. YouTube streams 1 billion hours of video per day. Getting the architecture right requires reasoning about upload pipelines, transcoding at scale, adaptive bitrate streaming, CDN delivery, and metadata storage — all at once. It’s a rich interview problem precisely because it touches so many layers.

Step 1: Clarify Requirements

Upload scale: 500 hours of video uploaded per minute (YouTube’s real number).
Streaming scale: 1B+ hours viewed per day, global audience.
Video formats: Accept any format from uploaders; deliver multiple resolutions (360p, 720p, 1080p, 4K).
Latency: Playback must start within 2–3 seconds. No buffering under normal conditions.
Features: Video search, recommendations, comments, like counts, view counts.

Step 2: Back-of-Envelope

Uploads:    500 hours/min = 30,000 min of video/min
            1 min of 1080p ≈ 200MB → 30,000 × 200MB = 6TB raw uploads/min
            After transcoding to 5 resolutions: ~3× storage = 18TB/min

Streaming:  1B hours/day = ~42M hours/sec... wait, let's redo:
            1B hours/day = ~11.6M hours/hour = ~200K concurrent streams
            1 stream at 720p ≈ 2.5 Mbps
            200K streams × 2.5 Mbps = 500 Gbps egress bandwidth

Storage:    500 hr/min × 60 min × 24 hr × 365 days × 3 (multi-res)
            ≈ 800PB/year → petabyte-scale object storage (S3-equivalent)

This immediately tells you: storage is S3/GCS, delivery is CDN (you cannot serve 500 Gbps from origin servers), and transcoding is a massively parallel pipeline.

Step 3: Two Separate Systems

YouTube is really two distinct systems that share a metadata store:

Upload & Processing Pipeline — ingesting, transcoding, storing video.
Streaming & Delivery System — serving video to viewers globally at low latency.

Design them separately, then tie them together.

Step 4: Upload Pipeline

User → Upload Service → Raw Video Storage (S3)
                              ↓
                       Kafka "video_uploaded"
                              ↓
                    Transcoding Workers (fleet)
                    ├─ Segment video into chunks
                    ├─ Transcode each chunk in parallel:
                    │    360p, 480p, 720p, 1080p, 4K
                    ├─ Generate thumbnails
                    └─ Write to CDN Origin Storage (S3)
                              ↓
                    Metadata DB update: status = "ready"
                              ↓
                    Notify user: "Your video is live"

Chunked Upload

A 2-hour 4K video is 50–100GB. Uploading as a single file is fragile — any network interruption loses the whole upload. Use resumable chunked upload:

1. Client calls POST /upload/init → gets upload_id
2. Client splits file into 5MB chunks
3. Client sends each chunk: PUT /upload/{upload_id}/chunk/{n}
4. Server stores chunks in S3, tracks which chunks arrived
5. Client calls POST /upload/{upload_id}/complete
6. Server assembles chunks → triggers transcoding pipeline

This is exactly how YouTube Data API v3, S3 Multipart Upload, and GCS Resumable Uploads work.

Transcoding

Transcoding (converting raw video to multiple resolutions/codecs) is CPU-intensive. One hour of 4K video takes ~30 minutes of CPU time to transcode to all resolutions. At 500 uploads/minute, you need a fleet of workers.

Parallelization strategy:

Split each video into 10-second segments.
Each segment is transcoded independently by a separate worker.
A coordinator reassembles segments into the final video.
This parallelizes what would be a 30-minute job into ~30 seconds using 60 workers.

Kafka coordinates the transcoding pipeline. Each transcoding job is a Kafka message; worker pools consume from the topic and scale horizontally.

Output Format: HLS / DASH

Don’t serve video as a single MP4. Use HLS (HTTP Live Streaming) or MPEG-DASH:

Video is divided into small segments (2–10 seconds each).
Each segment is encoded at multiple bitrates.
A manifest file (M3U8 for HLS) lists all available segments and resolutions.
The client requests segments sequentially via plain HTTP — CDN-cacheable.

# HLS manifest (m3u8)
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360
360p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=8000000,RESOLUTION=1920x1080
1080p/playlist.m3u8

Step 5: Adaptive Bitrate Streaming (ABR)

This is how YouTube handles varying network conditions without buffering:

Player starts at a medium quality (e.g., 480p).
Player monitors download speed of each segment.
If segments download faster than playback speed → player upgrades to higher quality.
If segments download slower than playback speed → player downgrades to lower quality.

Because each segment is a separate HTTP GET request, the player can change quality between segments seamlessly. No interruption in playback.

Step 6: CDN Delivery

500 Gbps egress cannot come from origin servers. CDNs (Akamai, Cloudflare, AWS CloudFront, or YouTube’s own CDN) cache video segments at edge nodes globally.

Viewer in Tokyo requests segment:
  → DNS resolves to nearest CDN edge node (Tokyo PoP)
  → Cache hit? Serve immediately (< 10ms latency)
  → Cache miss? Edge fetches from CDN origin (S3) → caches → serves
  → Next viewer in Tokyo: cache hit

Since video segments are static files served via HTTP, they are trivially cacheable. A popular video’s segments get cached at every CDN edge node globally after the first few viewers in each region watch it.

Long-tail videos (rarely watched) may not be cached at the edge — they fall back to origin. This is acceptable since their traffic volume is low.

Step 7: Metadata Storage

-- Videos table (relational - Postgres/MySQL)
videos (
    video_id    VARCHAR(11) PRIMARY KEY,  -- YouTube-style ID: "dQw4w9WgXcQ"
    uploader_id BIGINT,
    title       TEXT,
    description TEXT,
    duration_sec INT,
    status      ENUM('processing', 'ready', 'deleted'),
    created_at  TIMESTAMP,
    view_count  BIGINT DEFAULT 0,
    like_count  BIGINT DEFAULT 0
)

-- Video segments (S3 path index)
video_files (
    video_id    VARCHAR(11),
    resolution  VARCHAR(10),    -- '360p', '720p', etc.
    manifest_url TEXT,          -- S3/CDN URL to HLS manifest
    PRIMARY KEY (video_id, resolution)
)

Video search uses Elasticsearch — full-text search over titles, descriptions, and transcripts (YouTube auto-generates transcripts). Recommendations are a separate ML system outside this design.

Step 8: View Count at Scale

View counts are not as simple as UPDATE videos SET view_count = view_count + 1. At 1B views/day, that’s 11,500 UPDATE queries/second against one row. This causes lock contention and is a hotspot problem.

Solution — count in Redis, batch to DB:

View event → INCR video_view_count:{video_id}  (Redis, atomic, fast)
             ↓ (background job, every 60 seconds)
             UPDATE videos SET view_count = view_count + {delta}
             WHERE video_id = ?

Alternatively, stream view events to Kafka → analytics consumer aggregates → periodic DB update. This decouples the view counter from the hot read path entirely.

High-Level Architecture Summary

Upload:  User → Upload Service → S3 (raw)
                                  ↓ Kafka
                              Transcoding Fleet → S3/CDN (HLS segments)
                                  ↓
                              Metadata DB (video ready)

Stream:  User → DNS → CDN Edge
              Cache hit → serve segments immediately
              Cache miss → CDN Origin (S3) → cache → serve
              Manifest → player → ABR → request next segment

Metadata: Video page → API → Postgres (title, desc, counts)
          Search → Elasticsearch
          View counts → Redis → Kafka → Postgres (async)

Follow-up Questions

Q: How do you handle live streaming (different from on-demand)?
Live streaming uses a real-time ingest protocol (RTMP or WebRTC). Segments are 2s instead of 10s to reduce latency. There’s no pre-transcoding — segments are transcoded as they arrive. CDN cache TTLs are very short (2–10s). The tail of a live stream is treated as on-demand after broadcast ends.

Q: How do you recommend videos?
Recommendations are a separate ML system. Inputs: watch history, likes, search queries, video metadata, user demographics. Outputs: ranked video IDs. The recommendation service is called by the Feed Service; it’s not part of the video serving pipeline.

Q: How do you detect policy violations (copyright, inappropriate content)?
After transcoding, run video fingerprinting (Content ID) against a database of copyrighted content. Run ML classifiers for inappropriate content. Both happen asynchronously — the video may be briefly live before being flagged. For clearly illegal content, hold for manual review before publishing.

Summary

YouTube’s architecture splits cleanly into upload/transcoding and streaming/delivery. Chunked resumable uploads handle large files. Transcoding parallelizes by splitting video into segments processed by a worker fleet. HLS/DASH with adaptive bitrate gives smooth playback across network conditions. CDN edge caching absorbs virtually all egress bandwidth. View counts are handled with Redis counters flushed asynchronously to the DB. The metadata store is relational (Postgres) for structured queries and Elasticsearch for full-text search.