Live audio platforms like Twitter Spaces and Clubhouse create real-time audio rooms where speakers broadcast to thousands of listeners. Designing a live audio system tests your understanding of real-time audio transport, large-scale fan-out (few speakers to many listeners), moderation tools, and audio recording. This guide covers the architecture that differs from video conferencing (our Zoom guide) — specifically the asymmetric speaker/listener model and the social discovery layer.
Audio Room Architecture
An audio room has three participant types: host (created the room, full control), speakers (can talk, promoted by the host), and listeners (can only listen, can request to speak). Asymmetric model: 3-10 speakers broadcast to 100-10,000+ listeners. This is fundamentally different from a video call where all participants are equal. Speaker path (WebRTC): each speaker establishes a WebRTC connection to an SFU (Selective Forwarding Unit). The SFU receives audio from all speakers and creates a mixed audio stream. Audio mixing can happen: server-side (the SFU mixes all speaker streams into one — each listener receives a single stream, reducing client bandwidth and CPU) or client-side (the SFU forwards individual speaker streams — each listener mixes locally. Allows per-speaker volume control but requires more bandwidth). For rooms with 3-5 speakers: client-side mixing is feasible. For 10+ speakers: server-side mixing is preferred. Listener path: listeners do NOT use WebRTC. They receive the mixed audio stream via a low-latency streaming protocol: (1) WebSocket-based audio streaming (sub-second latency). (2) HLS with very short segments (1-2 seconds, higher latency but simpler). (3) WebRTC in receive-only mode (lowest latency but more complex at scale). The choice depends on latency requirements. Twitter Spaces uses WebRTC for speakers and a low-latency streaming protocol for listeners.
Scaling to Thousands of Listeners
A popular Space can have 50,000+ listeners. The SFU cannot maintain 50,000 individual connections. Scaling layers: (1) SFU tier — handles speakers (10 connections max). Produces the mixed audio stream. (2) Distribution tier — a tree of relay servers that fan out the audio stream. The SFU sends to 10 relay servers. Each relay sends to 100 more. Each leaf relay serves 500 listeners. Total: 50,000 listeners with 3 levels of fan-out. (3) CDN fallback — for very large rooms (100K+ listeners), use a CDN with HLS/DASH. Latency increases to 3-5 seconds but scales infinitely. Interactive listeners (who might be promoted to speaker) stay on the low-latency WebRTC/WebSocket path. Passive listeners use the CDN path. Adaptive: start with direct SFU connections for small rooms (< 100 listeners). Scale to relay tree at 100-1000. Scale to CDN at 1000+. The system adapts dynamically as listener count changes. Audio quality: Opus codec at 32-64 kbps per speaker (speech-optimized, much lower bandwidth than music). The mixed stream at 64 kbps serves all listeners. Total bandwidth: 50,000 listeners * 64 kbps = 3.2 Gbps. Distributed across relay servers: each handles 500 listeners * 64 kbps = 32 Mbps (trivial for modern servers).
Room Management and Moderation
Room lifecycle: (1) Host creates a room (title, topic, scheduled start time or start now). Room record: room_id, host_id, title, topic, status (scheduled/live/ended), speaker_ids, listener_count, created_at, started_at, ended_at. (2) Room goes live. The host opens their microphone. (3) Listeners join. The room appears in discovery (follower notifications, topic feeds). (4) Hand raise: a listener requests to speak. The host sees the request and can approve (promote to speaker) or ignore. Promoted listeners establish a WebRTC connection and can unmute. (5) Host can mute/remove speakers, end the room, or pin/unpin the room in their profile. Moderation tools: (1) Mute speaker — the host/co-host can mute any speaker. Server-side: stop forwarding that speaker audio stream. (2) Remove participant — kick from the room. Blocked from rejoining. (3) Report — listeners/speakers can report the room or specific participants for policy violations. (4) Automated moderation — speech-to-text (ASR) runs on the mixed audio stream in real-time. NLP classifies the transcript for policy violations (hate speech, threats, misinformation). Flagged segments are surfaced to human moderators. Latency: ASR + classification takes 2-5 seconds. By the time content is flagged, it has already been heard by listeners — moderation is reactive, not preventive. This is a fundamental challenge for live audio.
Recording and Replay
Recording a live audio room: the SFU sends the mixed audio stream to a recording service. The recording service encodes it as an audio file (AAC/MP3) and stores in S3. After the room ends: the recording is available for replay. The host can choose to publish or delete the recording. Replay architecture: the recorded audio file is served via CDN like a podcast. Seekable (standard audio streaming with byte-range requests). The recording includes metadata: participant list, timestamps for speaker changes, and chapter markers (if the host set them). Transcript: the ASR output (generated in real-time for moderation) is cleaned up and attached to the recording. Enables: (1) Search — find spaces discussing a specific topic by searching transcripts. (2) Accessibility — deaf/hard-of-hearing users can read the transcript. (3) Highlights — extract short clips from the transcript (text-selected audio segments). Live captions: the real-time ASR output is displayed as live captions during the room. Sent to listeners via a text channel alongside the audio stream. Latency: 2-3 seconds behind the audio. Accuracy: 85-95% for English (lower for other languages, accented speech, and domain-specific terminology).
Discovery and Social Graph
How users find live rooms: (1) Following — when someone you follow starts or joins a Space, you see a notification and a purple ring around their avatar. This is the primary discovery mechanism. (2) Topic-based — rooms tagged with topics (Technology, Music, Sports) appear in topic feeds. Users follow topics to see relevant rooms. (3) Algorithmic recommendation — “Spaces you might like” based on: topics of interest (inferred from follow graph and engagement), rooms joined by similar users (collaborative filtering), and room popularity (listener count, engagement signals). (4) Scheduled rooms — hosts can schedule rooms in advance. Followers see scheduled rooms and can set reminders. The scheduled room appears in a calendar-like discovery UI. Notification strategy: be selective to avoid notification fatigue. Do not notify for every Space from every followed account. Prioritize: close connections (frequent interactions), popular rooms (high listener count), and topics the user has engaged with before. The discovery feed ranks live rooms by: listener count, speaker notability (verified accounts, high follower count), topic relevance, and network signals (how many of your connections are in the room).