Design a Mobile Voice Assistant: Wake Word and On-Device ASR

Q: Why is on-device ASR better than cloud?

Lower latency, better privacy, works offline. Cloud was needed for accuracy a few years ago; on-device models have closed most of the gap.

Q: Can I make my own voice assistant for an app?

Possible but not recommended. Use platform APIs (Siri/Google Assistant integration) for system-level voice control. Custom voice for in-app commands.

Q: How does the wake word handle background noise?

The model is trained on noisy data. Modern phones have multi-microphone arrays for beamforming and noise rejection.

⏱ 2 min read

Voice assistants (Siri, Google Assistant, Alexa) are deceptively complex mobile system design topics. Always-listening for a wake word, on-device automatic speech recognition for low-latency commands, intent parsing, action routing — all on a battery-constrained mobile device. The 2026 reality has shifted significantly toward on-device processing for both privacy and latency.

Functional requirements

Detect wake word (“Hey Siri”, “OK Google”) in low-power mode
Transcribe spoken command (ASR)
Parse intent (what the user wants)
Execute the action (set timer, send message, query knowledge)
Speak response (TTS) where appropriate

The always-listening problem

The phone listens 24/7 for the wake word. Concerns:

Battery drain
Privacy (false triggers send audio to cloud)

Mitigations:

Dedicated low-power audio chip listens for the wake word (Apple A-series and Google Tensor have these)
Always-listening at <1mA
Wake-word detection runs entirely on-device
Only after wake-word match does the main processor wake up

Wake word detection

Tiny ML model trained specifically for the wake phrase. Inputs: short audio buffer. Output: probability of match.

Threshold tuning: too low → false triggers; too high → user repeats themselves. Manufacturers tune this aggressively.

Personalization: the model can be fine-tuned on the user’s voice to reduce false-accept-on-other-people.

On-device ASR

Once wake word matches, the next ~10 seconds of audio are processed:

Modern phones run Whisper-derived or proprietary ASR models on-device
Latency: sub-second for short commands
Privacy: audio never leaves the device for most commands

Cloud fallback: for long-form queries or low-confidence transcriptions, ASR can run in the cloud.

Intent parsing

Transcribed text → structured intent. Modern systems use LLM-based intent classification:

“Set a timer for 5 minutes” → SetTimer(duration=5min)
“What is the weather” → GetWeather(location=current)
“Tell me a joke” → TellJoke

For complex queries, route to the LLM for general-purpose response.

Action execution

Each intent maps to an action:

System actions (set timer, alarm, calendar)
App actions (play song in Spotify, send message in WhatsApp)
Knowledge queries (search the web, summarize)

App developers register actions via platform APIs (App Intents on iOS, Slices on Android).

Text-to-speech (TTS)

Modern TTS produces natural-sounding voices on-device. Latency: 100–300ms for typical sentences.

Voice cloning concerns: most platforms restrict to predefined voices to prevent misuse.

Privacy

The strongest argument for on-device:

Audio never leaves the device for most commands
No cloud has a transcript of your private conversations
Users can opt-out of cloud processing entirely (with reduced functionality)

Battery

Wake word: continuous, <1mA
Active session: bursty, ~50–200mA for a few seconds
Aggregate impact: typically <2% per day

Frequently Asked Questions

Why is on-device ASR better than cloud?

Lower latency, better privacy, works offline. Cloud was needed for accuracy a few years ago; on-device models have closed most of the gap.

Can I make my own voice assistant for an app?

Possible but not recommended. Use platform APIs (Siri/Google Assistant integration) for system-level voice control. Custom voice for in-app commands.

How does the wake word handle background noise?

The model is trained on noisy data. Modern phones have multi-microphone arrays for beamforming and noise rejection.