Design a Mobile Video Editor: CapCut and iMovie Style

⏱ 3 min read

“Design a mobile video editor” is one of the harder mobile-system-design questions because the stack involves GPU rendering, a non-trivial document model, and the unforgiving UX of timeline editing on a small screen. CapCut, InShot, iMovie, VN, Premiere Rush are the references. The interview tests whether you can articulate the rendering pipeline and the data structures behind a timeline.

Clarify scope

Linear cuts only or with effects, transitions, text, stickers?
4K and HDR support?
Audio editing — multi-track, effects, ducking?
AI effects (background removal, captions)?
Cloud sync for projects?

The document model

The timeline is a tree of tracks; each track is a sequence of clips with start time and duration. Clips reference assets (video file, audio, text, sticker). Effects attach to clips or to the entire timeline.

{
  duration: 95.4,
  tracks: [
    { id: "v1", type: "video", clips: [
      { id: "c1", asset: "asset-uuid", start: 0, in: 2.0, out: 12.5, effects: [...] },
      ...
    ]},
    { id: "a1", type: "audio", clips: [...] }
  ]
}

The document is the source of truth. Edits produce a new document state; the renderer derives frames from the state.

Asset storage

Original media stays in the photo library (or app sandbox) — never copy on import
References include the asset identifier plus a version (in case the user edits the original)
Thumbnails and waveforms cached separately for timeline display

The render pipeline

Decoder pulls frames from the source video at the requested time
Color conversion (YUV → RGB) on GPU
Effects shaders applied per clip (blur, color correction, transition)
Composition (overlay tracks, alpha blending)
Output: preview to screen, export to file

iOS: AVFoundation + Core Image / Metal. Android: MediaCodec + OpenGL ES / Vulkan. Both have battle-tested paths for video pipelines but require care.

Preview vs export

Preview is interactive — must run at 30+ fps even with multiple effects; uses lower-resolution proxies and aggressive caching
Export is offline — full quality, slower; can take longer than the clip duration on heavy edits
Show progress on export; allow cancel; resume not typically supported (each export is from scratch)

Timeline UI

Horizontal scrolling timeline with clip thumbnails
Pinch to zoom (frame-level vs second-level)
Drag to reorder, trim handles to adjust in/out
Snap to playhead, snap to other clips
Multi-track stacking with insert/append behaviors

Timeline interactions are surprisingly complex. Use a transactional model: each gesture produces a transaction; commit on gesture end; undo/redo navigate transactions.

Undo/redo

Same pattern as document editors: each user action is a transaction with forward and inverse operations. Stack-based history; clear on save (or persist for unlimited).

Audio

Multi-track mix with per-track volume
Ducking: lower music when voice is detected
Beat detection for music-driven cuts
Voice-to-text for auto-captions (on-device ML)

On-device ML effects

Background removal (segmentation): runs at preview time at lower resolution; full quality at export
Subject tracking for “follow the face” cropping
Auto-captions from speech recognition
Style-transfer filters (less common; expensive)

Use Core ML on iOS, ML Kit / TFLite on Android. Run on the Neural Engine / NPU when available; fallback to CPU.

Performance considerations

Frame caching at preview resolution
Effect-shader compilation on demand; warm popular shaders at app start
Memory budget: video frames are heavy; release behind the playhead
Thermal throttling: detect and reduce preview quality automatically
Battery: avoid running heavy ML continuously; only on-demand for effect application

Project sync (optional)

Document tree is small; sync via JSON to cloud
Asset references are device-local; cloud projects need cloud-asset upload
Conflict resolution if user edits on two devices: latest-write-wins or prompt

Export formats

H.264 / H.265 with selectable resolution and bitrate
Aspect ratios: 16:9, 9:16 (vertical for TikTok/Reels), 1:1, 4:5
Watermarking: removable on paid tier

What separates senior from staff

Senior candidates draw the document model and the render pipeline cleanly. Staff candidates discuss the proxy strategy (lower-resolution previews), the GPU memory budget, and the ML-effect quality-vs-cost tradeoff. Principal candidates touch on collaborative editing (CRDT for the document) and the cloud-render fallback for heavy projects.

Frequently Asked Questions

Should I use FFmpeg?

For native iOS/Android apps, prefer the platform pipelines (AVFoundation, MediaCodec). FFmpeg is heavy and licensing-aware (LGPL). For cross-platform exports it is sometimes the right call.

How do I handle very long videos (hours)?

Stream-decode rather than load whole files; use I-frame-only previews at coarse zoom; encourage the user to split into projects if very long.

What about cloud rendering?

For heavy effects or low-end devices, offload export to a cloud renderer. Latency must be acceptable to the user (minutes, not hours). Most consumer apps stay on-device.