Live audio rooms (Clubhouse, Twitter Spaces, Discord Stage Channels) are an interesting mobile system design topic. They sit between video conferencing (real-time interactivity) and live streaming (large-scale fanout). The interview tests whether you understand the tradeoffs and can design at the right scale.
Functional requirements
- Host can start a room with title and topic
- Listeners can join (passive)
- Listeners can request to speak; host can promote them
- Speakers send audio in real time to all listeners
- Chat / reactions alongside
- Recording (optional)
Non-functional
- Sub-300ms latency between speaker and listener
- Scale to thousands of listeners per room
- Reasonable battery for hour-long rooms
- Resilient to speaker network drops
Architecture
Two components:
- Speaker layer: handful of speakers send audio to a media server (SFU). WebRTC for low latency.
- Listener layer: thousands of listeners receive a mixed audio stream from a CDN-fed origin. Higher latency but cheaper at scale.
Why split speaker and listener layers?
WebRTC scales to ~100 peers per room before the SFU becomes the bottleneck. With thousands of listeners, you need different topology.
Implementation:
- Speakers connect to SFU via WebRTC
- SFU mixes speaker audio
- Mixed stream is sent to a transcoding/encoding service
- Encoded stream uploaded to CDN with HLS/DASH
- Listeners pull from CDN — 5–10s latency, scalable to millions
The “raise hand to speak” flow
- Listener taps “raise hand”
- Server notifies host with the request
- Host approves; listener is promoted to speaker
- Listener’s app switches from CDN-pull to WebRTC speaker connection
- Brief audio gap (1–3s) during the transition
Audio capture and processing
- Echo cancellation, noise suppression on capture (platform APIs)
- Voice activity detection — mute speakers automatically when not talking
- Codec: Opus, 32–64 kbps
- Mono channel; voice does not benefit from stereo
Moderation
- Host can mute speakers
- Host can remove participants
- Listeners can report
- Recording (with consent) for review
For abuse: real-time AI moderation (transcribe + classify); automatic action on policy violations.
Recording
Server-side recording of the mixed audio stream. Saved to S3 or equivalent. Optionally transcribed for searchable archives.
Battery
- Listeners on CDN-pull have negligible battery cost (just audio playback)
- Speakers running WebRTC are heavier
- Background mode supports listening with screen off
Why did Clubhouse fade?
Engineering was solid; product fit was the issue. Live audio competes with podcasts (asynchronous) and video (visual). Twitter Spaces survives because it integrates with the existing social graph.
Frequently Asked Questions
Why not use WebRTC for everyone?
SFU and TURN costs scale with peer count. WebRTC for 1000 peers in one room is expensive. CDN-pull is far cheaper for the listener layer.
How does the “promote to speaker” feel instant?
Pre-warm the WebRTC connection in the background while the user has hand raised. When promoted, the swap is faster.
Can the same architecture handle live video?
Yes — Twitter Spaces with cameras, Discord Stage with video. Same speaker/listener split with video added.