Voice assistants (Siri, Google Assistant, Alexa) are deceptively complex mobile system design topics. Always-listening for a wake word, on-device automatic speech recognition for low-latency commands, intent parsing, action routing — all on a battery-constrained mobile device. The 2026 reality has shifted significantly toward on-device processing for both privacy and latency.
Functional requirements
- Detect wake word (“Hey Siri”, “OK Google”) in low-power mode
- Transcribe spoken command (ASR)
- Parse intent (what the user wants)
- Execute the action (set timer, send message, query knowledge)
- Speak response (TTS) where appropriate
The always-listening problem
The phone listens 24/7 for the wake word. Concerns:
- Battery drain
- Privacy (false triggers send audio to cloud)
Mitigations:
- Dedicated low-power audio chip listens for the wake word (Apple A-series and Google Tensor have these)
- Always-listening at <1mA
- Wake-word detection runs entirely on-device
- Only after wake-word match does the main processor wake up
Wake word detection
Tiny ML model trained specifically for the wake phrase. Inputs: short audio buffer. Output: probability of match.
Threshold tuning: too low → false triggers; too high → user repeats themselves. Manufacturers tune this aggressively.
Personalization: the model can be fine-tuned on the user’s voice to reduce false-accept-on-other-people.
On-device ASR
Once wake word matches, the next ~10 seconds of audio are processed:
- Modern phones run Whisper-derived or proprietary ASR models on-device
- Latency: sub-second for short commands
- Privacy: audio never leaves the device for most commands
Cloud fallback: for long-form queries or low-confidence transcriptions, ASR can run in the cloud.
Intent parsing
Transcribed text → structured intent. Modern systems use LLM-based intent classification:
- “Set a timer for 5 minutes” → SetTimer(duration=5min)
- “What is the weather” → GetWeather(location=current)
- “Tell me a joke” → TellJoke
For complex queries, route to the LLM for general-purpose response.
Action execution
Each intent maps to an action:
- System actions (set timer, alarm, calendar)
- App actions (play song in Spotify, send message in WhatsApp)
- Knowledge queries (search the web, summarize)
App developers register actions via platform APIs (App Intents on iOS, Slices on Android).
Text-to-speech (TTS)
Modern TTS produces natural-sounding voices on-device. Latency: 100–300ms for typical sentences.
Voice cloning concerns: most platforms restrict to predefined voices to prevent misuse.
Privacy
The strongest argument for on-device:
- Audio never leaves the device for most commands
- No cloud has a transcript of your private conversations
- Users can opt-out of cloud processing entirely (with reduced functionality)
Battery
- Wake word: continuous, <1mA
- Active session: bursty, ~50–200mA for a few seconds
- Aggregate impact: typically <2% per day
Frequently Asked Questions
Why is on-device ASR better than cloud?
Lower latency, better privacy, works offline. Cloud was needed for accuracy a few years ago; on-device models have closed most of the gap.
Can I make my own voice assistant for an app?
Possible but not recommended. Use platform APIs (Siri/Google Assistant integration) for system-level voice control. Custom voice for in-app commands.
How does the wake word handle background noise?
The model is trained on noisy data. Modern phones have multi-microphone arrays for beamforming and noise rejection.