Design a Mobile Voice Memos / Audio Recording App

“Design a voice memos app” looks like a simple recorder until you list the requirements: high-quality audio capture, on-device transcription, cloud sync, sharing, and the iOS Voice Memos polish bar that everyone is benchmarked against. Otter.ai, Apple Voice Memos, Google Recorder, Pixel’s voice recorder are references. The interview tests audio engineering, ML transcription, and the unique product polish of audio.

Clarify scope

  • Quick voice notes or full transcription product?
  • Single-speaker, meeting (multi-speaker), or both?
  • Cloud sync across devices?
  • Sharing with non-users?
  • Live transcription or post-recording?

Audio capture

  • iOS: AVAudioRecorder or AVAudioEngine for more control
  • Android: MediaRecorder or AudioRecord
  • Format: AAC (small, common) or Apple Lossless (high quality)
  • Sample rate: 16 kHz for speech, 44.1 kHz for general
  • Mono is fine for speech

Background recording

  • iOS: enable “Audio” background mode; the app continues recording when locked
  • Android: foreground service with notification; required for background-while-locked
  • User must see clear indication of active recording (red dot, persistent banner)

Audio levels

  • Real-time audio meter during recording
  • Visualization: VU meter, waveform, frequency spectrum
  • Useful for confirming microphone is working and levels are good

Transcription

On-device

  • iOS: Speech framework (SFSpeechRecognizer) — supports many languages, on-device since iOS 13
  • Android: SpeechRecognizer with on-device variant
  • Or bundle Whisper.cpp for higher quality
  • Pros: privacy, no network
  • Cons: model size, slower than cloud

Cloud

  • Whisper API, Deepgram, AssemblyAI, Google Cloud Speech-to-Text
  • Higher quality, especially for multilingual / accented audio
  • Cost-per-minute billing
  • Latency: streaming for live, batch for post-recording

Hybrid

Most modern apps in 2026: on-device default; cloud “high-quality” option opt-in.

Speaker diarization

Multi-speaker transcription needs separating voices:

  • Pyannote-style on-device models
  • Cloud APIs (Deepgram, AssemblyAI) include diarization
  • UX: identify “Speaker 1”, “Speaker 2”; user can rename
  • Otter.ai is the reference for this UX
  • Once transcribed, the audio becomes searchable text
  • Index transcripts for full-text search
  • Search jumps to the timestamp in the audio
  • Highlights matched phrases in the transcript

Cloud sync

  • Audio file + transcript synced
  • Audio uploaded once; transcript can be regenerated cloud-side
  • Conflict: append-only, single-author
  • iCloud / Google Drive / app-specific cloud

Sharing

  • Share audio file via system share sheet
  • Or share a link to a transcript page
  • Time-stamped quote: “Listen at 12:45”
  • Privacy: link with optional password / expiration

Editing

  • Trim start / end
  • Cut middle sections
  • Edit transcript and re-export
  • Apple Voice Memos: simple trim-and-replace; advanced editing not supported
  • Otter / Descript: word-level edits affect audio (cuts the corresponding audio chunks)

Storage and management

  • Per-recording metadata (date, location optionally, duration, title)
  • Folder organization
  • Search across recordings
  • Delete: confirm; cleared from cloud too
  • Storage quota indicator

Privacy

  • Microphone permission with clear rationale
  • Recording indicator visible at all times
  • Privacy nutrition labels: what is uploaded, what stays on device
  • Some jurisdictions require all-party consent for recordings

Performance considerations

  • Audio file size: 1 hour at 32 kbps AAC ~14 MB; manageable
  • Transcript generation: 1 minute audio → 5–30 sec on-device, faster on cloud
  • Battery: continuous recording is moderate drain; transcription on-device is heavier
  • Memory: stream-decode long recordings; do not load whole file

Watch / iOS Live Activity

  • Watch app: start/stop recording (Apple Voice Memos has this)
  • Live Activity: visible recording indicator on lock screen
  • Background continues seamlessly

Edge cases interviewers love

  • Phone call interrupts recording — pause cleanly, resume after
  • App killed during long recording — restore from where it left off
  • Storage runs out — graceful “stop and save what you have”
  • Transcription fails (model timeout) — keep audio, retry transcription
  • Two-party consent state — surface a clear notice if location requires

What separates senior from staff

Senior candidates handle audio capture and basic transcription. Staff candidates discuss on-device vs cloud transcription, diarization, the editing pipeline (waveform-aware), and privacy/consent. Principal candidates raise the live-vs-batch transcription architecture and the multi-device sync conflict model.

Frequently Asked Questions

Should I use Whisper directly?

For on-device, Whisper.cpp or distill-whisper. For cloud, the Whisper API (or Deepgram, which is faster and supports diarization). Whisper has the best multilingual quality.

How do I handle very long recordings (8+ hours)?

Stream the audio to disk in chunks. Transcribe in batches, not the whole file at once. Most apps cap at 4–8 hours per recording for product simplicity.

What about LLM-powered summaries?

2025+ trend: send transcript to an LLM for summary, action items, key topics. Otter, Granola, Fireflies all do this. Strong product feature but costly per-recording.

Scroll to Top