Design a Mobile Translation Camera: Google Lens-Style Real-Time

“Design a translation camera” is a vision-and-AR mobile system design prompt — Google Lens / Translate, Apple’s Live Text + Translate, Microsoft Translator. The interview probes whether you understand on-device ML, the AR overlay for live text replacement, and the unique latency and privacy tradeoffs of real-time camera intelligence.

Clarify scope

  • Real-time camera overlay or capture-and-translate?
  • How many language pairs?
  • Offline or always-online?
  • Voice translation in scope?
  • Conversation mode (back-and-forth speech)?

The pipeline

  1. Camera frame captured (~30 fps preview)
  2. OCR detects text regions and reads characters
  3. Text grouped into translatable segments
  4. Translation model produces target-language text
  5. AR overlay replaces source text with translated text on the live preview
  6. Audio output if voice mode

On-device vs cloud

The latency and privacy story drives this:

  • On-device: private, works offline, no per-query cost, lower quality on rare languages
  • Cloud: higher quality, all languages, requires connectivity, privacy concerns with sending camera frames
  • Hybrid: on-device for common languages and OCR; cloud fallback for rare languages or when explicit “high quality” tap

Modern apps lean on-device for live preview and cloud for post-capture detail.

OCR engineering

  • Detection: localize text regions in the frame (ML model)
  • Recognition: read characters within each region
  • Frame-to-frame stability: smooth detected boxes; do not flicker
  • Text grouping: combine line-level reads into translatable segments
  • iOS: Vision framework (VNRecognizeTextRequest); Android: ML Kit Text Recognition
  • Custom models for specialized scripts (Asian languages, Devanagari, Arabic)

Translation

  • Per-pair translation model (Marian, NLLB, Helsinki-NLP family on-device)
  • Cloud fallback (Google, DeepL, Microsoft Translator APIs)
  • Cache common phrases to reduce repeated calls
  • Quality varies wildly by pair and domain (signage vs prose vs idiom)

The AR overlay

The killer feature. Steps:

  1. Detect the text region with bounding polygon (not just rectangle)
  2. Sample background color around the text
  3. Render translated text on a colored rectangle that masks the source
  4. Match font size, color, orientation as best as possible
  5. Track the region frame-to-frame so the overlay sticks to moving text

Frame-to-frame tracking

  • Run OCR on every Nth frame (e.g., every 3rd) for cost
  • Track detected boxes between OCR runs using optical flow
  • Avoid re-translating identical text — cache by region content
  • If text leaves frame and reappears, the cache may still hit

Latency budget

  • Frame capture: 33 ms (30 fps)
  • OCR: 50–200 ms on-device (model-dependent)
  • Translation: 30–100 ms on-device
  • Render: 16 ms
  • Total: under 300 ms for a smooth-feeling experience

Hit this by running OCR and translation off the main thread, in parallel with the next frame.

Voice mode

  • ASR captures speech in source language
  • Translation produces target text
  • TTS speaks the translation
  • Conversation mode: turn-taking with the other speaker; speech detection on both sides
  • Same latency budget but harder; usually 1–2 seconds end-to-end is acceptable for speech

Privacy

  • Camera frames are sensitive — explicit user permission with rationale
  • On-device processing is the default for live preview
  • If sending frames to the cloud, document what is sent, retain briefly, allow opt-out
  • Voice in conversation mode raises eavesdropping concerns; design with consent indicators

Battery and thermal

  • Continuous OCR + translation is heavy on the Neural Engine / NPU
  • Reduce frame rate when device is hot
  • Stop processing when screen is dimmed or app backgrounded
  • Battery indicator if user is in continuous mode for long periods

Edge cases interviewers love

  • Curved text (e.g., wine bottle labels) — model must handle
  • Stylized fonts in signage — quality drops
  • Very small text — OCR confidence threshold
  • Mixed languages on one page
  • Right-to-left scripts (Arabic, Hebrew)
  • Glare, shadows, occlusion

What separates senior from staff

Senior candidates draw the OCR+translate+overlay pipeline. Staff candidates discuss the on-device-vs-cloud hybrid, frame-to-frame tracking, latency budget, and the AR overlay rendering. Principal candidates raise the offline language-pack download story, the multilingual conversation-mode UX, and the privacy threat model.

Frequently Asked Questions

What ML frameworks should I use?

iOS: Vision + Core ML. Android: ML Kit + TensorFlow Lite. Custom models compiled for the Neural Engine / NPU. ML Kit translation has on-device packs for 50+ pairs.

How do I handle a low-resource language?

Cloud-only is the realistic answer; on-device models for niche languages are not yet usable. Surface a “low quality” indicator if confidence is low.

What about handwriting?

Different model from print OCR; lower accuracy. Out of scope for most products; add as a separate mode if needed.

Scroll to Top