YouTube on mobile is the canonical “design a video app” interview. Three billion users, infinite catalog, every device on earth. The interview tests your ability to balance playback latency, network bandwidth, decoder constraints, and the realities of someone watching a 4K video on a 5-year-old Android over LTE.
Functional requirements
- Browse a personalized feed of videos
- Search and play videos on demand
- Adaptive bitrate streaming based on network
- Subtitles and captions in multiple languages
- Offline downloads for Premium users
- Background audio playback
Non-functional
- Time-to-first-frame under 1 second on good network
- Smooth playback without rebuffer events at 99th percentile
- Reasonable cellular data usage
Architecture
Three components: player, manifest fetcher, buffer manager.
The video manifest
YouTube uses DASH (and HLS for Apple devices). When a user taps a video:
- Client fetches a manifest (a few KB) describing all available bitrates and codecs
- Client picks the appropriate variant for the device + network
- Client requests segments (typically 2–4 seconds each) and feeds the decoder
Adaptive bitrate (ABR)
The ABR algorithm decides which bitrate to use for the next segment based on:
- Recent download throughput (bandwidth estimation)
- Buffer occupancy (how much video is already downloaded)
- Device capabilities (screen size, decoder support)
- User preferences (data saver, “always HD”)
Algorithms range from BBA (buffer-based) to MPC (model predictive control). Industry standard: a hybrid that prefers buffer health over bitrate aggression.
Captions and subtitles
Captions arrive as a separate WebVTT or TTML file. The renderer overlays them on the video at appropriate timestamps. Multiple languages are downloaded only when selected.
Auto-generated captions are produced server-side and served on demand. Mobile app does not run ASR locally.
Background audio
YouTube Premium allows audio-only playback when the screen is off. Implementation:
- iOS: AVAudioSession with playback category, declare background audio capability
- Android: foreground service with media notification, MediaSession API
The video decoder is paused; the audio stream continues.
Offline downloads
Download manager fetches all segments at the chosen resolution, encrypts with a per-device key, stores locally. Downloads are subject to TTL re-validation; if offline too long, downloads expire.
Battery and data
- Default to 720p on cellular for most users
- Pause prefetch when battery low
- HEVC and AV1 codecs save bandwidth where supported, but require more CPU/battery to decode
Frequently Asked Questions
Why does playback sometimes stall right after start?
The ABR may have over-estimated bandwidth from the first segment. Modern algorithms use a conservative initial estimate and ramp up.
How does YouTube reduce time-to-first-frame?
Aggressive prefetch of the first segment when a thumbnail is highly likely to be tapped, server-side video transcoding for fast-start MP4, and CDN edges close to the user.
What is the right segment size?
Tradeoff: smaller segments = faster ABR adaptation but more HTTP overhead. 2–4 seconds is the industry default. Live streams use shorter (1–2s) for low latency.