YouTube is the second largest search engine in the world, serving over 1 billion hours of video daily to 2+ billion monthly active users. Designing a video platform like YouTube combines nearly every system design concept: media processing, CDN distribution, recommendation engines, real-time search, and comment systems. This is the ultimate system design question — it touches everything. This guide provides a comprehensive architecture covering upload through playback.
Video Upload and Processing Pipeline
Upload flow: (1) The client requests an upload URL from the API server. The server generates a presigned S3/GCS URL with a resumable upload session. (2) The client uploads the raw video directly to object storage using resumable upload (if the connection drops, resume from the last uploaded chunk — critical for large files over mobile networks). (3) An S3 event triggers the transcoding pipeline. (4) The transcoding service splits the video into segments and encodes each in parallel across GPU workers. Output: multiple resolution/bitrate combinations (144p through 4K), each segmented for adaptive bitrate streaming. YouTube encodes each video into approximately 20+ streams. (5) Manifests (HLS .m3u8 / DASH .mpd) are generated listing all available streams and segments. (6) Thumbnails are auto-generated (extract frames at multiple timestamps, ML selects the most visually appealing). The creator can also upload a custom thumbnail. (7) Content moderation: ML models scan for policy violations (violence, nudity, copyright). Flagged videos are queued for human review. (8) After processing completes, the video status changes to AVAILABLE and appears in search/recommendations.
Video Storage and CDN
Storage: YouTube stores over 800 million videos. Each video has 20+ encoded versions at different resolutions. A 10-minute video at all resolutions totals approximately 1-5 GB. Total storage: exabytes. All stored in Google Cloud Storage (object storage) with multi-region replication. CDN: Google operates a global CDN with edge caches inside ISP networks (similar to Netflix Open Connect). Popular videos are cached at the edge. Long-tail videos (viewed rarely) are served from regional caches or the origin. Cache tiering: L1 (ISP edge, hot content) -> L2 (regional PoP, warm content) -> origin (cold content). Approximately 80% of views are served from L1/L2 cache. Adaptive bitrate streaming: the player estimates bandwidth and selects the appropriate quality for each segment. If bandwidth drops, the next segment is fetched at a lower quality. This prevents rebuffering while maximizing quality. The player maintains a 30-second buffer. Cost optimization: encode popular videos into all quality levels immediately. For rarely-viewed videos, encode only the most common resolutions (360p, 720p) initially. Encode higher resolutions on-demand when requested. This saves significant transcoding compute for the long-tail.
Video Search
YouTube search indexes video metadata: title, description, tags, closed captions (auto-generated speech-to-text), channel name, and category. The search system is built on a distributed inverted index (similar to Elasticsearch). Query processing: (1) Parse and normalize the query (spell correction, synonym expansion). (2) Retrieve candidate videos matching the query terms from the inverted index. (3) Rank candidates using an ML model that considers: text relevance (BM25 score on title/description), video engagement (click-through rate, watch time, likes), freshness (newer videos boost for trending topics), creator authority (channel subscriber count, historical performance), and personalization (the user watch history, preferred categories, language). (4) Diversify results (avoid showing 10 videos from the same channel). (5) Insert ads at designated positions. The ranking model is trained on billions of query-click pairs. The primary optimization metric is user satisfaction, measured by watch time (not just clicks — a clickbait video with high CTR but low watch time is ranked lower). Auto-suggest: as the user types, suggest completions from popular queries using a trie or Elasticsearch completion suggester (covered in our Search Autocomplete guide).
Recommendation Engine
Recommendations drive 70%+ of YouTube watch time. The system has two stages: (1) Candidate generation — from a corpus of 800M+ videos, generate a few thousand candidates. Sources: user watch history (videos from channels the user has watched), collaborative filtering (users with similar watch history also watched X), content-based (videos with similar titles, categories, or audio features), and trending/popular videos in the user region. Deep neural networks learn user and video embeddings in the same vector space; nearest-neighbor retrieval finds candidates. (2) Ranking — an ML model scores each candidate for the specific user. Features include: video age, channel relationship (subscribed?), topic match (inferred from watch history), predicted watch time, predicted engagement (like, share, comment), and negative signals (predicted dislike, predicted “not interested”). The ranker produces a score per video. The recommendation feed is ordered by score, with diversity injection (avoid showing the same topic consecutively). The recommendation system runs offline (candidate generation with Spark/MapReduce) and online (ranking at request time with a serving model). Updates to the model are deployed multiple times per day.
Comments System
YouTube comments are a high-write, high-read system. A popular video receives thousands of comments per minute. Architecture: comments are stored in a distributed database (Spanner or Bigtable) partitioned by video_id. Each comment: comment_id, video_id, user_id, parent_comment_id (for replies), text, created_at, like_count. Comment display: default sort is “Top comments” (ranked by engagement: likes, replies, recency). “Newest first” is a simple time-sort. The top-comments ranking requires a scoring model considering: like count, reply count, comment age, commenter subscriber count, and sentiment (positive comments ranked higher). Comment moderation: ML models classify comments in real-time for: spam, hate speech, self-promotion, and scams. Flagged comments are hidden or held for review. Creators can set moderation levels (hold all for review, allow subscribers only, block specific words). Read path: the first 20 comments are loaded with the video page. Subsequent comments are loaded on scroll (infinite pagination with cursor-based pagination by sort key). Comment counts: displayed as approximate (“1.2K comments”) using a cached counter, updated every few seconds.
Live Streaming
YouTube Live adds real-time constraints to the architecture. The creator streams via RTMP (Real-Time Messaging Protocol) to a YouTube ingest server. The ingest server transcodes the stream in real-time into multiple ABR quality levels and segments them for HLS/DASH delivery. Ultra-low-latency mode: segments as short as 2 seconds (vs 6-10 seconds for normal live). Trade-off: shorter segments = lower latency but higher CDN load (more requests per second). Live chat: a real-time messaging system alongside the video. Messages are broadcast to all viewers via WebSocket or server-sent events. For streams with millions of viewers, messages are sampled (not all messages are shown to all viewers) and a “top chat” filter shows only highlighted/super chat messages. DVR (rewind): viewers can rewind the live stream up to 4 hours. The segments are stored in object storage as they are created, allowing on-demand playback of past segments while the stream continues. After the stream ends, the recording is processed like a regular upload (additional transcoding, thumbnail generation, indexing for search).