“Design a mobile video editor” is one of the harder mobile-system-design questions because the stack involves GPU rendering, a non-trivial document model, and the unforgiving UX of timeline editing on a small screen. CapCut, InShot, iMovie, VN, Premiere Rush are the references. The interview tests whether you can articulate the rendering pipeline and the data structures behind a timeline.
Clarify scope
- Linear cuts only or with effects, transitions, text, stickers?
- 4K and HDR support?
- Audio editing — multi-track, effects, ducking?
- AI effects (background removal, captions)?
- Cloud sync for projects?
The document model
The timeline is a tree of tracks; each track is a sequence of clips with start time and duration. Clips reference assets (video file, audio, text, sticker). Effects attach to clips or to the entire timeline.
{
duration: 95.4,
tracks: [
{ id: "v1", type: "video", clips: [
{ id: "c1", asset: "asset-uuid", start: 0, in: 2.0, out: 12.5, effects: [...] },
...
]},
{ id: "a1", type: "audio", clips: [...] }
]
}
The document is the source of truth. Edits produce a new document state; the renderer derives frames from the state.
Asset storage
- Original media stays in the photo library (or app sandbox) — never copy on import
- References include the asset identifier plus a version (in case the user edits the original)
- Thumbnails and waveforms cached separately for timeline display
The render pipeline
- Decoder pulls frames from the source video at the requested time
- Color conversion (YUV → RGB) on GPU
- Effects shaders applied per clip (blur, color correction, transition)
- Composition (overlay tracks, alpha blending)
- Output: preview to screen, export to file
iOS: AVFoundation + Core Image / Metal. Android: MediaCodec + OpenGL ES / Vulkan. Both have battle-tested paths for video pipelines but require care.
Preview vs export
- Preview is interactive — must run at 30+ fps even with multiple effects; uses lower-resolution proxies and aggressive caching
- Export is offline — full quality, slower; can take longer than the clip duration on heavy edits
- Show progress on export; allow cancel; resume not typically supported (each export is from scratch)
Timeline UI
- Horizontal scrolling timeline with clip thumbnails
- Pinch to zoom (frame-level vs second-level)
- Drag to reorder, trim handles to adjust in/out
- Snap to playhead, snap to other clips
- Multi-track stacking with insert/append behaviors
Timeline interactions are surprisingly complex. Use a transactional model: each gesture produces a transaction; commit on gesture end; undo/redo navigate transactions.
Undo/redo
Same pattern as document editors: each user action is a transaction with forward and inverse operations. Stack-based history; clear on save (or persist for unlimited).
Audio
- Multi-track mix with per-track volume
- Ducking: lower music when voice is detected
- Beat detection for music-driven cuts
- Voice-to-text for auto-captions (on-device ML)
On-device ML effects
- Background removal (segmentation): runs at preview time at lower resolution; full quality at export
- Subject tracking for “follow the face” cropping
- Auto-captions from speech recognition
- Style-transfer filters (less common; expensive)
Use Core ML on iOS, ML Kit / TFLite on Android. Run on the Neural Engine / NPU when available; fallback to CPU.
Performance considerations
- Frame caching at preview resolution
- Effect-shader compilation on demand; warm popular shaders at app start
- Memory budget: video frames are heavy; release behind the playhead
- Thermal throttling: detect and reduce preview quality automatically
- Battery: avoid running heavy ML continuously; only on-demand for effect application
Project sync (optional)
- Document tree is small; sync via JSON to cloud
- Asset references are device-local; cloud projects need cloud-asset upload
- Conflict resolution if user edits on two devices: latest-write-wins or prompt
Export formats
- H.264 / H.265 with selectable resolution and bitrate
- Aspect ratios: 16:9, 9:16 (vertical for TikTok/Reels), 1:1, 4:5
- Watermarking: removable on paid tier
What separates senior from staff
Senior candidates draw the document model and the render pipeline cleanly. Staff candidates discuss the proxy strategy (lower-resolution previews), the GPU memory budget, and the ML-effect quality-vs-cost tradeoff. Principal candidates touch on collaborative editing (CRDT for the document) and the cloud-render fallback for heavy projects.
Frequently Asked Questions
Should I use FFmpeg?
For native iOS/Android apps, prefer the platform pipelines (AVFoundation, MediaCodec). FFmpeg is heavy and licensing-aware (LGPL). For cross-platform exports it is sometimes the right call.
How do I handle very long videos (hours)?
Stream-decode rather than load whole files; use I-frame-only previews at coarse zoom; encourage the user to split into projects if very long.
What about cloud rendering?
For heavy effects or low-end devices, offload export to a cloud renderer. Latency must be acceptable to the user (minutes, not hours). Most consumer apps stay on-device.