Design a Mobile Voice Memos / Audio Recording App

⏱ 3 min read

“Design a voice memos app” looks like a simple recorder until you list the requirements: high-quality audio capture, on-device transcription, cloud sync, sharing, and the iOS Voice Memos polish bar that everyone is benchmarked against. Otter.ai, Apple Voice Memos, Google Recorder, Pixel’s voice recorder are references. The interview tests audio engineering, ML transcription, and the unique product polish of audio.

Clarify scope

Quick voice notes or full transcription product?
Single-speaker, meeting (multi-speaker), or both?
Cloud sync across devices?
Sharing with non-users?
Live transcription or post-recording?

Audio capture

iOS: AVAudioRecorder or AVAudioEngine for more control
Android: MediaRecorder or AudioRecord
Format: AAC (small, common) or Apple Lossless (high quality)
Sample rate: 16 kHz for speech, 44.1 kHz for general
Mono is fine for speech

Background recording

iOS: enable “Audio” background mode; the app continues recording when locked
Android: foreground service with notification; required for background-while-locked
User must see clear indication of active recording (red dot, persistent banner)

Audio levels

Real-time audio meter during recording
Visualization: VU meter, waveform, frequency spectrum
Useful for confirming microphone is working and levels are good

Transcription

On-device

iOS: Speech framework (SFSpeechRecognizer) — supports many languages, on-device since iOS 13
Android: SpeechRecognizer with on-device variant
Or bundle Whisper.cpp for higher quality
Pros: privacy, no network
Cons: model size, slower than cloud

Cloud

Whisper API, Deepgram, AssemblyAI, Google Cloud Speech-to-Text
Higher quality, especially for multilingual / accented audio
Cost-per-minute billing
Latency: streaming for live, batch for post-recording

Hybrid

Most modern apps in 2026: on-device default; cloud “high-quality” option opt-in.

Speaker diarization

Multi-speaker transcription needs separating voices:

Pyannote-style on-device models
Cloud APIs (Deepgram, AssemblyAI) include diarization
UX: identify “Speaker 1”, “Speaker 2”; user can rename
Otter.ai is the reference for this UX

Search

Once transcribed, the audio becomes searchable text
Index transcripts for full-text search
Search jumps to the timestamp in the audio
Highlights matched phrases in the transcript

Cloud sync

Audio file + transcript synced
Audio uploaded once; transcript can be regenerated cloud-side
Conflict: append-only, single-author
iCloud / Google Drive / app-specific cloud

Share audio file via system share sheet
Or share a link to a transcript page
Time-stamped quote: “Listen at 12:45”
Privacy: link with optional password / expiration

Editing

Trim start / end
Cut middle sections
Edit transcript and re-export
Apple Voice Memos: simple trim-and-replace; advanced editing not supported
Otter / Descript: word-level edits affect audio (cuts the corresponding audio chunks)

Storage and management

Per-recording metadata (date, location optionally, duration, title)
Folder organization
Search across recordings
Delete: confirm; cleared from cloud too
Storage quota indicator

Privacy

Microphone permission with clear rationale
Recording indicator visible at all times
Privacy nutrition labels: what is uploaded, what stays on device
Some jurisdictions require all-party consent for recordings

Performance considerations

Audio file size: 1 hour at 32 kbps AAC ~14 MB; manageable
Transcript generation: 1 minute audio → 5–30 sec on-device, faster on cloud
Battery: continuous recording is moderate drain; transcription on-device is heavier
Memory: stream-decode long recordings; do not load whole file

Watch / iOS Live Activity

Watch app: start/stop recording (Apple Voice Memos has this)
Live Activity: visible recording indicator on lock screen
Background continues seamlessly

Edge cases interviewers love

Phone call interrupts recording — pause cleanly, resume after
App killed during long recording — restore from where it left off
Storage runs out — graceful “stop and save what you have”
Transcription fails (model timeout) — keep audio, retry transcription
Two-party consent state — surface a clear notice if location requires

What separates senior from staff

Senior candidates handle audio capture and basic transcription. Staff candidates discuss on-device vs cloud transcription, diarization, the editing pipeline (waveform-aware), and privacy/consent. Principal candidates raise the live-vs-batch transcription architecture and the multi-device sync conflict model.

Frequently Asked Questions

Should I use Whisper directly?

For on-device, Whisper.cpp or distill-whisper. For cloud, the Whisper API (or Deepgram, which is faster and supports diarization). Whisper has the best multilingual quality.

How do I handle very long recordings (8+ hours)?

Stream the audio to disk in chunks. Transcribe in batches, not the whole file at once. Most apps cap at 4–8 hours per recording for product simplicity.

What about LLM-powered summaries?

2025+ trend: send transcript to an LLM for summary, action items, key topics. Otter, Granola, Fireflies all do this. Strong product feature but costly per-recording.