System Design Interview: Music Streaming Service (Spotify)

Q: How does Spotify's Discover Weekly recommendation system work at scale?

Discover Weekly uses a three-stage pipeline that runs weekly for 600 million users. Stage 1: collaborative filtering via matrix factorization (ALS algorithm on Spark) processes the 600M user x 100M track play matrix. Users with similar listening patterns are identified; their unexplored tracks become candidates. Stage 2: content-based filtering uses audio fingerprinting and NLP on track metadata to extract acoustic features (tempo, energy, danceability). Acoustically similar tracks to the user history are added as candidates. Stage 3: re-ranking applies a gradient boosted tree with features including audio similarity, recency, diversity (no two consecutive tracks from the same artist), and freshness (unexplored tracks score higher). The final playlist of 30 tracks is pre-computed and written to Cassandra. 600M write operations complete in 6 hours every Sunday — approximately 28,000 writes per second, within Cassandra cluster capacity.

Q: How does audio streaming work differently from video streaming?

Audio and video streaming share the adaptive bitrate (ABR) architecture: files are segmented into chunks (10-second chunks for audio, 2-6 seconds for video), the client buffers ahead, and bitrate is adjusted dynamically based on measured bandwidth. The key differences: (1) Bandwidth: audio at 320kbps premium requires ~7.5MB per 3-minute track. Video at 4K HDR requires 15-25 Mbps — 50x more bandwidth. (2) Buffer strategy: audio clients prebuffer 30-60 seconds; video clients typically buffer 8-30 seconds. (3) Latency tolerance: audio listeners tolerate 1-2 second buffering; live sports video needs sub-5 second latency, requiring HLS-LL or DASH-CMAF. (4) DRM complexity: music has simpler DRM (per-track licenses) vs video DRM which must handle offline downloads of full TV seasons. (5) CDN caching: audio files are smaller and have higher reuse — the same track is played millions of times, making CDN cache hit rates above 90%.

Q: How would you design offline music playback with DRM?

Offline playback requires downloading audio files and managing DRM licenses on-device. Architecture: the client stores downloaded audio in encrypted local storage (AES-256). The encryption key is derived from the user credentials and a device-specific salt, so files cannot be moved to another device. Each downloaded track has an associated DRM license (Widevine on Android, FairPlay on iOS) specifying the license TTL (typically 30 days). On playback, the client checks local storage first: if the file exists and the license is valid, play locally. If the license expired, fetch a new license from the license server (requires internet connection). If offline and license expired, playback is blocked. The sync service tracks which tracks are downloaded per device (device_id, track_id, download_quality, expires_at). When a user upgrades or removes a track, the client deletes the local file and revokes the license. Premium subscription check happens at license renewal, not at each play — allowing 30 days of offline access after subscription lapses.

⏱ 8 min read

Spotify serves 600M users, 100M tracks, and 9M daily podcast episodes. Designing a music streaming service covers audio delivery, catalog scale, real-time personalization, and offline sync — touching almost every major systems design concept. Common at Spotify (obviously), Apple, Amazon, and Google interviews.

Requirements

Functional: Search for tracks, albums, artists. Play audio without buffering. Create and follow playlists. Discover music through personalized recommendations (“Daily Mix,” “Discover Weekly”). Play offline (downloaded content). Social: see what friends are listening to.

Non-functional: Audio starts within 1 second of pressing play. 99.99% uptime — listening must work during brief backend outages. CDN-delivered audio at 320kbps (premium) or 96kbps (free). Catalog freshness: new tracks available within minutes of label upload. Scale: 600M users, 10M concurrent streams at peak.

Data Model

Track: track_id, title, artist_id[], album_id, duration_ms, release_date, explicit, audio_file_id, lyrics_id, play_count

Audio file: file_id, track_id, format (AAC/OGG/MP3), bitrate, storage_url, duration_ms, fingerprint (for dedup)

User listening history: user_id, track_id, played_at, ms_played, context (playlist_id or album_id)

Catalog data (tracks, albums, artists) lives in a relational database — highly relational, read-heavy. Listening history is append-only and high-volume — Cassandra or BigQuery partitioned by user_id and date.

Audio Storage and CDN Delivery

Audio files are stored in S3 or GCS. A single 3-minute track at 320kbps = ~7.5MB. With 100M tracks × multiple formats/bitrates × ~3 copies = ~10PB total storage. Files are transcoded on upload to multiple formats (AAC 320kbps, AAC 128kbps, OGG 96kbps) by a transcoding pipeline (AWS MediaConvert or custom FFmpeg workers on SQS).

Audio is delivered via CDN (Akamai, Cloudflare, or Fastly). The CDN caches popular tracks at edge nodes in 200+ PoPs worldwide. CDN cache hit rate is 85-95% for top-50% tracks. Cache miss: CDN fetches from origin (S3 + pre-signed URL). Track popularity is heavy-tailed — the top 1% of tracks account for 70% of plays.

Adaptive Bitrate Streaming

Audio files are segmented into 10-second chunks. The client maintains a playback buffer of 30-60 seconds. The client dynamically selects bitrate based on measured bandwidth: good connection → 320kbps, poor connection → 96kbps. Prebuffering: when the user presses play, the client immediately requests the first 30 seconds of audio before the user hears anything — so initial playback starts within 200-500ms even on slow connections. This is why Spotify feels instant despite serving large audio files.

Catalog Search

Elasticsearch powers search across 100M tracks. Documents are indexed by track title, artist name, album name, and genre tags with TF-IDF weights. Autocomplete: a separate prefix-indexed field (“search_as_you_type” mapping) supports instant suggestions as users type. Query flow: user types “billie” → Elasticsearch returns tracks matching prefix “billie” in <50ms → sorted by play_count, release_recency, and personalization score → top 10 returned. Full-text relevance combines lexical matching and popularity signal.

Playlist Management

Playlists are stored with a collaborative editing model. Each playlist is a document in Cassandra: playlist_id, owner_id, name, track_id list (ordered), follower_count, is_public. Track reordering is stored as a sorted list of (track_id, position) pairs. Adding/removing tracks uses conditional updates (optimistic locking via a version column) to handle concurrent edits. Collaborative playlists use OT (Operational Transform) or CRDTs — the same technique as Google Docs — to merge concurrent edits without conflicts.

Recommendations: Discover Weekly

Discover Weekly runs weekly for 600M users. The pipeline:

Collaborative filtering: matrix factorization (ALS) on the 600M × 100M user-track play matrix. Users who played similar tracks are considered similar; their unexplored tracks become candidates. Run on Spark, takes ~6 hours on 1000 nodes.
Content-based filtering: audio fingerprinting + NLP on track metadata extracts acoustic features (tempo, key, energy, danceability). Find tracks acoustically similar to the user history.
Re-ranking: take top-500 candidates from each source, re-rank using a gradient boosted tree with features: recency, audio similarity, diversity (no two consecutive tracks from the same artist), freshness (prefer unheard tracks).
Delivery: pre-computed playlist written to Cassandra at midnight Sunday. 600M write operations in 6 hours = ~28K writes/second — within Cassandra cluster capacity.

Offline Sync

Premium users can download tracks for offline playback. The mobile client stores downloaded tracks in encrypted local storage (AES-256, key derived from user credentials). The sync service tracks which tracks are downloaded per device (device_id, track_id, downloaded_at, quality). On play, the client checks local storage first — if present and not expired (DRM licenses have TTL), play locally. If absent, stream from CDN. DRM (Widevine, FairPlay) ensures downloaded tracks cannot be shared — each download is encrypted with a device-specific key that expires after 30 days.

The “Friend Activity” sidebar shows what friends are listening to in real time. Architecture: when a user starts playing a track, the mobile app sends a “now playing” event to a presence service. The presence service stores the event in Redis with a TTL of 60 seconds (refreshed every 30 seconds while playing). When a user opens the sidebar, the app queries the presence service for the user follow list, batches the lookups (pipeline Redis GET for each friend_id), and returns the results. Real-time updates via WebSocket push — the presence service fans out now-playing events to all followers who are currently online.

Scale Numbers

600M users, 10M concurrent streams at peak
Audio bandwidth: 10M streams × 320kbps average = 3.2 Tbps — served via CDN
API requests: ~1M RPS (search, play events, playlist updates)
Listening events: 600M users × 15 events/day = 9B events/day → Kafka topic → BigQuery for analytics
Discover Weekly: 600M playlists computed in 6 hours every Sunday

Frequently Asked Questions

How does Spotify's Discover Weekly recommendation system work at scale?

Discover Weekly uses a three-stage pipeline that runs weekly for 600 million users. Stage 1: collaborative filtering via matrix factorization (ALS algorithm on Spark) processes the 600M user x 100M track play matrix. Users with similar listening patterns are identified; their unexplored tracks become candidates. Stage 2: content-based filtering uses audio fingerprinting and NLP on track metadata to extract acoustic features (tempo, energy, danceability). Acoustically similar tracks to the user history are added as candidates. Stage 3: re-ranking applies a gradient boosted tree with features including audio similarity, recency, diversity (no two consecutive tracks from the same artist), and freshness (unexplored tracks score higher). The final playlist of 30 tracks is pre-computed and written to Cassandra. 600M write operations complete in 6 hours every Sunday — approximately 28,000 writes per second, within Cassandra cluster capacity.

How does audio streaming work differently from video streaming?

Audio and video streaming share the adaptive bitrate (ABR) architecture: files are segmented into chunks (10-second chunks for audio, 2-6 seconds for video), the client buffers ahead, and bitrate is adjusted dynamically based on measured bandwidth. The key differences: (1) Bandwidth: audio at 320kbps premium requires ~7.5MB per 3-minute track. Video at 4K HDR requires 15-25 Mbps — 50x more bandwidth. (2) Buffer strategy: audio clients prebuffer 30-60 seconds; video clients typically buffer 8-30 seconds. (3) Latency tolerance: audio listeners tolerate 1-2 second buffering; live sports video needs sub-5 second latency, requiring HLS-LL or DASH-CMAF. (4) DRM complexity: music has simpler DRM (per-track licenses) vs video DRM which must handle offline downloads of full TV seasons. (5) CDN caching: audio files are smaller and have higher reuse — the same track is played millions of times, making CDN cache hit rates above 90%.

How would you design offline music playback with DRM?

Offline playback requires downloading audio files and managing DRM licenses on-device. Architecture: the client stores downloaded audio in encrypted local storage (AES-256). The encryption key is derived from the user credentials and a device-specific salt, so files cannot be moved to another device. Each downloaded track has an associated DRM license (Widevine on Android, FairPlay on iOS) specifying the license TTL (typically 30 days). On playback, the client checks local storage first: if the file exists and the license is valid, play locally. If the license expired, fetch a new license from the license server (requires internet connection). If offline and license expired, playback is blocked. The sync service tracks which tracks are downloaded per device (device_id, track_id, download_quality, expires_at). When a user upgrades or removes a track, the client deletes the local file and revokes the license. Premium subscription check happens at license renewal, not at each play — allowing 30 days of offline access after subscription lapses.

Apple Interview Guide

LinkedIn Interview Guide

Airbnb Interview Guide

Twitter Interview Guide

Companies That Ask This Question

Meta Engineering Interview Guide

Netflix Engineering Interview Guide

Snap Engineering Interview Guide

Databricks Engineering Interview Guide