“Design a voice memos app” looks like a simple recorder until you list the requirements: high-quality audio capture, on-device transcription, cloud sync, sharing, and the iOS Voice Memos polish bar that everyone is benchmarked against. Otter.ai, Apple Voice Memos, Google Recorder, Pixel’s voice recorder are references. The interview tests audio engineering, ML transcription, and the unique product polish of audio.
Clarify scope
- Quick voice notes or full transcription product?
- Single-speaker, meeting (multi-speaker), or both?
- Cloud sync across devices?
- Sharing with non-users?
- Live transcription or post-recording?
Audio capture
- iOS: AVAudioRecorder or AVAudioEngine for more control
- Android: MediaRecorder or AudioRecord
- Format: AAC (small, common) or Apple Lossless (high quality)
- Sample rate: 16 kHz for speech, 44.1 kHz for general
- Mono is fine for speech
Background recording
- iOS: enable “Audio” background mode; the app continues recording when locked
- Android: foreground service with notification; required for background-while-locked
- User must see clear indication of active recording (red dot, persistent banner)
Audio levels
- Real-time audio meter during recording
- Visualization: VU meter, waveform, frequency spectrum
- Useful for confirming microphone is working and levels are good
Transcription
On-device
- iOS: Speech framework (SFSpeechRecognizer) — supports many languages, on-device since iOS 13
- Android: SpeechRecognizer with on-device variant
- Or bundle Whisper.cpp for higher quality
- Pros: privacy, no network
- Cons: model size, slower than cloud
Cloud
- Whisper API, Deepgram, AssemblyAI, Google Cloud Speech-to-Text
- Higher quality, especially for multilingual / accented audio
- Cost-per-minute billing
- Latency: streaming for live, batch for post-recording
Hybrid
Most modern apps in 2026: on-device default; cloud “high-quality” option opt-in.
Speaker diarization
Multi-speaker transcription needs separating voices:
- Pyannote-style on-device models
- Cloud APIs (Deepgram, AssemblyAI) include diarization
- UX: identify “Speaker 1”, “Speaker 2”; user can rename
- Otter.ai is the reference for this UX
Search
- Once transcribed, the audio becomes searchable text
- Index transcripts for full-text search
- Search jumps to the timestamp in the audio
- Highlights matched phrases in the transcript
Cloud sync
- Audio file + transcript synced
- Audio uploaded once; transcript can be regenerated cloud-side
- Conflict: append-only, single-author
- iCloud / Google Drive / app-specific cloud
Sharing
- Share audio file via system share sheet
- Or share a link to a transcript page
- Time-stamped quote: “Listen at 12:45”
- Privacy: link with optional password / expiration
Editing
- Trim start / end
- Cut middle sections
- Edit transcript and re-export
- Apple Voice Memos: simple trim-and-replace; advanced editing not supported
- Otter / Descript: word-level edits affect audio (cuts the corresponding audio chunks)
Storage and management
- Per-recording metadata (date, location optionally, duration, title)
- Folder organization
- Search across recordings
- Delete: confirm; cleared from cloud too
- Storage quota indicator
Privacy
- Microphone permission with clear rationale
- Recording indicator visible at all times
- Privacy nutrition labels: what is uploaded, what stays on device
- Some jurisdictions require all-party consent for recordings
Performance considerations
- Audio file size: 1 hour at 32 kbps AAC ~14 MB; manageable
- Transcript generation: 1 minute audio → 5–30 sec on-device, faster on cloud
- Battery: continuous recording is moderate drain; transcription on-device is heavier
- Memory: stream-decode long recordings; do not load whole file
Watch / iOS Live Activity
- Watch app: start/stop recording (Apple Voice Memos has this)
- Live Activity: visible recording indicator on lock screen
- Background continues seamlessly
Edge cases interviewers love
- Phone call interrupts recording — pause cleanly, resume after
- App killed during long recording — restore from where it left off
- Storage runs out — graceful “stop and save what you have”
- Transcription fails (model timeout) — keep audio, retry transcription
- Two-party consent state — surface a clear notice if location requires
What separates senior from staff
Senior candidates handle audio capture and basic transcription. Staff candidates discuss on-device vs cloud transcription, diarization, the editing pipeline (waveform-aware), and privacy/consent. Principal candidates raise the live-vs-batch transcription architecture and the multi-device sync conflict model.
Frequently Asked Questions
Should I use Whisper directly?
For on-device, Whisper.cpp or distill-whisper. For cloud, the Whisper API (or Deepgram, which is faster and supports diarization). Whisper has the best multilingual quality.
How do I handle very long recordings (8+ hours)?
Stream the audio to disk in chunks. Transcribe in batches, not the whole file at once. Most apps cap at 4–8 hours per recording for product simplicity.
What about LLM-powered summaries?
2025+ trend: send transcript to an LLM for summary, action items, key topics. Otter, Granola, Fireflies all do this. Strong product feature but costly per-recording.