Design a Mobile Video Editor: CapCut and iMovie Style

“Design a mobile video editor” is one of the harder mobile-system-design questions because the stack involves GPU rendering, a non-trivial document model, and the unforgiving UX of timeline editing on a small screen. CapCut, InShot, iMovie, VN, Premiere Rush are the references. The interview tests whether you can articulate the rendering pipeline and the data structures behind a timeline.

Clarify scope

  • Linear cuts only or with effects, transitions, text, stickers?
  • 4K and HDR support?
  • Audio editing — multi-track, effects, ducking?
  • AI effects (background removal, captions)?
  • Cloud sync for projects?

The document model

The timeline is a tree of tracks; each track is a sequence of clips with start time and duration. Clips reference assets (video file, audio, text, sticker). Effects attach to clips or to the entire timeline.

{
  duration: 95.4,
  tracks: [
    { id: "v1", type: "video", clips: [
      { id: "c1", asset: "asset-uuid", start: 0, in: 2.0, out: 12.5, effects: [...] },
      ...
    ]},
    { id: "a1", type: "audio", clips: [...] }
  ]
}

The document is the source of truth. Edits produce a new document state; the renderer derives frames from the state.

Asset storage

  • Original media stays in the photo library (or app sandbox) — never copy on import
  • References include the asset identifier plus a version (in case the user edits the original)
  • Thumbnails and waveforms cached separately for timeline display

The render pipeline

  1. Decoder pulls frames from the source video at the requested time
  2. Color conversion (YUV → RGB) on GPU
  3. Effects shaders applied per clip (blur, color correction, transition)
  4. Composition (overlay tracks, alpha blending)
  5. Output: preview to screen, export to file

iOS: AVFoundation + Core Image / Metal. Android: MediaCodec + OpenGL ES / Vulkan. Both have battle-tested paths for video pipelines but require care.

Preview vs export

  • Preview is interactive — must run at 30+ fps even with multiple effects; uses lower-resolution proxies and aggressive caching
  • Export is offline — full quality, slower; can take longer than the clip duration on heavy edits
  • Show progress on export; allow cancel; resume not typically supported (each export is from scratch)

Timeline UI

  • Horizontal scrolling timeline with clip thumbnails
  • Pinch to zoom (frame-level vs second-level)
  • Drag to reorder, trim handles to adjust in/out
  • Snap to playhead, snap to other clips
  • Multi-track stacking with insert/append behaviors

Timeline interactions are surprisingly complex. Use a transactional model: each gesture produces a transaction; commit on gesture end; undo/redo navigate transactions.

Undo/redo

Same pattern as document editors: each user action is a transaction with forward and inverse operations. Stack-based history; clear on save (or persist for unlimited).

Audio

  • Multi-track mix with per-track volume
  • Ducking: lower music when voice is detected
  • Beat detection for music-driven cuts
  • Voice-to-text for auto-captions (on-device ML)

On-device ML effects

  • Background removal (segmentation): runs at preview time at lower resolution; full quality at export
  • Subject tracking for “follow the face” cropping
  • Auto-captions from speech recognition
  • Style-transfer filters (less common; expensive)

Use Core ML on iOS, ML Kit / TFLite on Android. Run on the Neural Engine / NPU when available; fallback to CPU.

Performance considerations

  • Frame caching at preview resolution
  • Effect-shader compilation on demand; warm popular shaders at app start
  • Memory budget: video frames are heavy; release behind the playhead
  • Thermal throttling: detect and reduce preview quality automatically
  • Battery: avoid running heavy ML continuously; only on-demand for effect application

Project sync (optional)

  • Document tree is small; sync via JSON to cloud
  • Asset references are device-local; cloud projects need cloud-asset upload
  • Conflict resolution if user edits on two devices: latest-write-wins or prompt

Export formats

  • H.264 / H.265 with selectable resolution and bitrate
  • Aspect ratios: 16:9, 9:16 (vertical for TikTok/Reels), 1:1, 4:5
  • Watermarking: removable on paid tier

What separates senior from staff

Senior candidates draw the document model and the render pipeline cleanly. Staff candidates discuss the proxy strategy (lower-resolution previews), the GPU memory budget, and the ML-effect quality-vs-cost tradeoff. Principal candidates touch on collaborative editing (CRDT for the document) and the cloud-render fallback for heavy projects.

Frequently Asked Questions

Should I use FFmpeg?

For native iOS/Android apps, prefer the platform pipelines (AVFoundation, MediaCodec). FFmpeg is heavy and licensing-aware (LGPL). For cross-platform exports it is sometimes the right call.

How do I handle very long videos (hours)?

Stream-decode rather than load whole files; use I-frame-only previews at coarse zoom; encourage the user to split into projects if very long.

What about cloud rendering?

For heavy effects or low-end devices, offload export to a cloud renderer. Latency must be acceptable to the user (minutes, not hours). Most consumer apps stay on-device.

Scroll to Top