Design a Mobile Voice Assistant: Wake Word and On-Device ASR

Voice assistants (Siri, Google Assistant, Alexa) are deceptively complex mobile system design topics. Always-listening for a wake word, on-device automatic speech recognition for low-latency commands, intent parsing, action routing — all on a battery-constrained mobile device. The 2026 reality has shifted significantly toward on-device processing for both privacy and latency.

Functional requirements

  • Detect wake word (“Hey Siri”, “OK Google”) in low-power mode
  • Transcribe spoken command (ASR)
  • Parse intent (what the user wants)
  • Execute the action (set timer, send message, query knowledge)
  • Speak response (TTS) where appropriate

The always-listening problem

The phone listens 24/7 for the wake word. Concerns:

  • Battery drain
  • Privacy (false triggers send audio to cloud)

Mitigations:

  • Dedicated low-power audio chip listens for the wake word (Apple A-series and Google Tensor have these)
  • Always-listening at <1mA
  • Wake-word detection runs entirely on-device
  • Only after wake-word match does the main processor wake up

Wake word detection

Tiny ML model trained specifically for the wake phrase. Inputs: short audio buffer. Output: probability of match.

Threshold tuning: too low → false triggers; too high → user repeats themselves. Manufacturers tune this aggressively.

Personalization: the model can be fine-tuned on the user’s voice to reduce false-accept-on-other-people.

On-device ASR

Once wake word matches, the next ~10 seconds of audio are processed:

  • Modern phones run Whisper-derived or proprietary ASR models on-device
  • Latency: sub-second for short commands
  • Privacy: audio never leaves the device for most commands

Cloud fallback: for long-form queries or low-confidence transcriptions, ASR can run in the cloud.

Intent parsing

Transcribed text → structured intent. Modern systems use LLM-based intent classification:

  • “Set a timer for 5 minutes” → SetTimer(duration=5min)
  • “What is the weather” → GetWeather(location=current)
  • “Tell me a joke” → TellJoke

For complex queries, route to the LLM for general-purpose response.

Action execution

Each intent maps to an action:

  • System actions (set timer, alarm, calendar)
  • App actions (play song in Spotify, send message in WhatsApp)
  • Knowledge queries (search the web, summarize)

App developers register actions via platform APIs (App Intents on iOS, Slices on Android).

Text-to-speech (TTS)

Modern TTS produces natural-sounding voices on-device. Latency: 100–300ms for typical sentences.

Voice cloning concerns: most platforms restrict to predefined voices to prevent misuse.

Privacy

The strongest argument for on-device:

  • Audio never leaves the device for most commands
  • No cloud has a transcript of your private conversations
  • Users can opt-out of cloud processing entirely (with reduced functionality)

Battery

  • Wake word: continuous, <1mA
  • Active session: bursty, ~50–200mA for a few seconds
  • Aggregate impact: typically <2% per day

Frequently Asked Questions

Why is on-device ASR better than cloud?

Lower latency, better privacy, works offline. Cloud was needed for accuracy a few years ago; on-device models have closed most of the gap.

Can I make my own voice assistant for an app?

Possible but not recommended. Use platform APIs (Siri/Google Assistant integration) for system-level voice control. Custom voice for in-app commands.

How does the wake word handle background noise?

The model is trained on noisy data. Modern phones have multi-microphone arrays for beamforming and noise rejection.

Scroll to Top