Live audio platforms like Twitter Spaces and Clubhouse create real-time audio rooms where speakers broadcast to thousands of listeners. Designing a live audio system tests your understanding of real-time audio transport, large-scale fan-out (few speakers to many listeners), moderation tools, and audio recording. This guide covers the architecture that differs from video conferencing (our Zoom guide) — specifically the asymmetric speaker/listener model and the social discovery layer.
Audio Room Architecture
An audio room has three participant types: host (created the room, full control), speakers (can talk, promoted by the host), and listeners (can only listen, can request to speak). Asymmetric model: 3-10 speakers broadcast to 100-10,000+ listeners. This is fundamentally different from a video call where all participants are equal. Speaker path (WebRTC): each speaker establishes a WebRTC connection to an SFU (Selective Forwarding Unit). The SFU receives audio from all speakers and creates a mixed audio stream. Audio mixing can happen: server-side (the SFU mixes all speaker streams into one — each listener receives a single stream, reducing client bandwidth and CPU) or client-side (the SFU forwards individual speaker streams — each listener mixes locally. Allows per-speaker volume control but requires more bandwidth). For rooms with 3-5 speakers: client-side mixing is feasible. For 10+ speakers: server-side mixing is preferred. Listener path: listeners do NOT use WebRTC. They receive the mixed audio stream via a low-latency streaming protocol: (1) WebSocket-based audio streaming (sub-second latency). (2) HLS with very short segments (1-2 seconds, higher latency but simpler). (3) WebRTC in receive-only mode (lowest latency but more complex at scale). The choice depends on latency requirements. Twitter Spaces uses WebRTC for speakers and a low-latency streaming protocol for listeners.
Scaling to Thousands of Listeners
A popular Space can have 50,000+ listeners. The SFU cannot maintain 50,000 individual connections. Scaling layers: (1) SFU tier — handles speakers (10 connections max). Produces the mixed audio stream. (2) Distribution tier — a tree of relay servers that fan out the audio stream. The SFU sends to 10 relay servers. Each relay sends to 100 more. Each leaf relay serves 500 listeners. Total: 50,000 listeners with 3 levels of fan-out. (3) CDN fallback — for very large rooms (100K+ listeners), use a CDN with HLS/DASH. Latency increases to 3-5 seconds but scales infinitely. Interactive listeners (who might be promoted to speaker) stay on the low-latency WebRTC/WebSocket path. Passive listeners use the CDN path. Adaptive: start with direct SFU connections for small rooms (< 100 listeners). Scale to relay tree at 100-1000. Scale to CDN at 1000+. The system adapts dynamically as listener count changes. Audio quality: Opus codec at 32-64 kbps per speaker (speech-optimized, much lower bandwidth than music). The mixed stream at 64 kbps serves all listeners. Total bandwidth: 50,000 listeners * 64 kbps = 3.2 Gbps. Distributed across relay servers: each handles 500 listeners * 64 kbps = 32 Mbps (trivial for modern servers).
Room Management and Moderation
Room lifecycle: (1) Host creates a room (title, topic, scheduled start time or start now). Room record: room_id, host_id, title, topic, status (scheduled/live/ended), speaker_ids, listener_count, created_at, started_at, ended_at. (2) Room goes live. The host opens their microphone. (3) Listeners join. The room appears in discovery (follower notifications, topic feeds). (4) Hand raise: a listener requests to speak. The host sees the request and can approve (promote to speaker) or ignore. Promoted listeners establish a WebRTC connection and can unmute. (5) Host can mute/remove speakers, end the room, or pin/unpin the room in their profile. Moderation tools: (1) Mute speaker — the host/co-host can mute any speaker. Server-side: stop forwarding that speaker audio stream. (2) Remove participant — kick from the room. Blocked from rejoining. (3) Report — listeners/speakers can report the room or specific participants for policy violations. (4) Automated moderation — speech-to-text (ASR) runs on the mixed audio stream in real-time. NLP classifies the transcript for policy violations (hate speech, threats, misinformation). Flagged segments are surfaced to human moderators. Latency: ASR + classification takes 2-5 seconds. By the time content is flagged, it has already been heard by listeners — moderation is reactive, not preventive. This is a fundamental challenge for live audio.
Recording and Replay
Recording a live audio room: the SFU sends the mixed audio stream to a recording service. The recording service encodes it as an audio file (AAC/MP3) and stores in S3. After the room ends: the recording is available for replay. The host can choose to publish or delete the recording. Replay architecture: the recorded audio file is served via CDN like a podcast. Seekable (standard audio streaming with byte-range requests). The recording includes metadata: participant list, timestamps for speaker changes, and chapter markers (if the host set them). Transcript: the ASR output (generated in real-time for moderation) is cleaned up and attached to the recording. Enables: (1) Search — find spaces discussing a specific topic by searching transcripts. (2) Accessibility — deaf/hard-of-hearing users can read the transcript. (3) Highlights — extract short clips from the transcript (text-selected audio segments). Live captions: the real-time ASR output is displayed as live captions during the room. Sent to listeners via a text channel alongside the audio stream. Latency: 2-3 seconds behind the audio. Accuracy: 85-95% for English (lower for other languages, accented speech, and domain-specific terminology).
Discovery and Social Graph
How users find live rooms: (1) Following — when someone you follow starts or joins a Space, you see a notification and a purple ring around their avatar. This is the primary discovery mechanism. (2) Topic-based — rooms tagged with topics (Technology, Music, Sports) appear in topic feeds. Users follow topics to see relevant rooms. (3) Algorithmic recommendation — “Spaces you might like” based on: topics of interest (inferred from follow graph and engagement), rooms joined by similar users (collaborative filtering), and room popularity (listener count, engagement signals). (4) Scheduled rooms — hosts can schedule rooms in advance. Followers see scheduled rooms and can set reminders. The scheduled room appears in a calendar-like discovery UI. Notification strategy: be selective to avoid notification fatigue. Do not notify for every Space from every followed account. Prioritize: close connections (frequent interactions), popular rooms (high listener count), and topics the user has engaged with before. The discovery feed ranks live rooms by: listener count, speaker notability (verified accounts, high follower count), topic relevance, and network signals (how many of your connections are in the room).
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do live audio rooms scale from 10 to 50,000 listeners?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Adaptive scaling with three tiers: Small rooms (<100 listeners): SFU handles speakers via WebRTC. Listeners connect directly to the SFU receiving the mixed audio stream. Medium rooms (100-1,000): relay tree. SFU sends mixed audio to 10 relay servers. Each relay serves 100 listeners. Listeners connect via WebSocket or low-latency streaming. Large rooms (1,000-50,000+): CDN distribution. The mixed audio is segmented into HLS/DASH and served via CDN edge servers. Latency increases to 3-5 seconds but scales infinitely. Interactive listeners (potential speakers) stay on WebRTC; passive listeners use CDN. The system adapts dynamically as listener count changes. Bandwidth: 50K listeners * 64 kbps Opus = 3.2 Gbps total, distributed across relay servers at 32 Mbps each (trivial). Speakers always use WebRTC to the SFU (10 connections max)."}},{"@type":"Question","name":"How does real-time audio moderation work in live rooms?","acceptedAnswer":{"@type":"Answer","text":"Speech-to-text (ASR) runs on the mixed audio stream in real-time, producing a transcript with 2-5 second delay. NLP classifies the transcript for policy violations: hate speech, threats, harassment, misinformation. Flagged segments are surfaced to human moderators. Fundamental challenge: by the time content is flagged (2-5 seconds), listeners have already heard it. Moderation is reactive, not preventive, for live audio. Mitigation: host moderation tools (mute/remove speakers instantly), co-host roles for additional moderation capacity, and automated muting if the confidence of violation is very high. The ASR transcript is also cleaned up and attached to the recording for post-hoc review, enabling search across past rooms and accessibility via captions."}}]}