Design a Live Audio Room App: Clubhouse / Twitter Spaces

⏱ 2 min read

Live audio rooms (Clubhouse, Twitter Spaces, Discord Stage Channels) are an interesting mobile system design topic. They sit between video conferencing (real-time interactivity) and live streaming (large-scale fanout). The interview tests whether you understand the tradeoffs and can design at the right scale.

Functional requirements

Host can start a room with title and topic
Listeners can join (passive)
Listeners can request to speak; host can promote them
Speakers send audio in real time to all listeners
Chat / reactions alongside
Recording (optional)

Non-functional

Sub-300ms latency between speaker and listener
Scale to thousands of listeners per room
Reasonable battery for hour-long rooms
Resilient to speaker network drops

Architecture

Two components:

Speaker layer: handful of speakers send audio to a media server (SFU). WebRTC for low latency.
Listener layer: thousands of listeners receive a mixed audio stream from a CDN-fed origin. Higher latency but cheaper at scale.

Why split speaker and listener layers?

WebRTC scales to ~100 peers per room before the SFU becomes the bottleneck. With thousands of listeners, you need different topology.

Implementation:

Speakers connect to SFU via WebRTC
SFU mixes speaker audio
Mixed stream is sent to a transcoding/encoding service
Encoded stream uploaded to CDN with HLS/DASH
Listeners pull from CDN — 5–10s latency, scalable to millions

The “raise hand to speak” flow

Listener taps “raise hand”
Server notifies host with the request
Host approves; listener is promoted to speaker
Listener’s app switches from CDN-pull to WebRTC speaker connection
Brief audio gap (1–3s) during the transition

Audio capture and processing

Echo cancellation, noise suppression on capture (platform APIs)
Voice activity detection — mute speakers automatically when not talking
Codec: Opus, 32–64 kbps
Mono channel; voice does not benefit from stereo

Moderation

Host can mute speakers
Host can remove participants
Listeners can report
Recording (with consent) for review

For abuse: real-time AI moderation (transcribe + classify); automatic action on policy violations.

Recording

Server-side recording of the mixed audio stream. Saved to S3 or equivalent. Optionally transcribed for searchable archives.

Battery

Listeners on CDN-pull have negligible battery cost (just audio playback)
Speakers running WebRTC are heavier
Background mode supports listening with screen off

Why did Clubhouse fade?

Engineering was solid; product fit was the issue. Live audio competes with podcasts (asynchronous) and video (visual). Twitter Spaces survives because it integrates with the existing social graph.

Frequently Asked Questions

Why not use WebRTC for everyone?

SFU and TURN costs scale with peer count. WebRTC for 1000 peers in one room is expensive. CDN-pull is far cheaper for the listener layer.

How does the “promote to speaker” feel instant?

Pre-warm the WebRTC connection in the background while the user has hand raised. When promoted, the swap is faster.

Can the same architecture handle live video?

Yes — Twitter Spaces with cameras, Discord Stage with video. Same speaker/listener split with video added.