Design a Mobile Translation Camera: Google Lens-Style Real-Time

⏱ 3 min read

“Design a translation camera” is a vision-and-AR mobile system design prompt — Google Lens / Translate, Apple’s Live Text + Translate, Microsoft Translator. The interview probes whether you understand on-device ML, the AR overlay for live text replacement, and the unique latency and privacy tradeoffs of real-time camera intelligence.

Clarify scope

Real-time camera overlay or capture-and-translate?
How many language pairs?
Offline or always-online?
Voice translation in scope?
Conversation mode (back-and-forth speech)?

The pipeline

Camera frame captured (~30 fps preview)
OCR detects text regions and reads characters
Text grouped into translatable segments
Translation model produces target-language text
AR overlay replaces source text with translated text on the live preview
Audio output if voice mode

On-device vs cloud

The latency and privacy story drives this:

On-device: private, works offline, no per-query cost, lower quality on rare languages
Cloud: higher quality, all languages, requires connectivity, privacy concerns with sending camera frames
Hybrid: on-device for common languages and OCR; cloud fallback for rare languages or when explicit “high quality” tap

Modern apps lean on-device for live preview and cloud for post-capture detail.

OCR engineering

Detection: localize text regions in the frame (ML model)
Recognition: read characters within each region
Frame-to-frame stability: smooth detected boxes; do not flicker
Text grouping: combine line-level reads into translatable segments
iOS: Vision framework (VNRecognizeTextRequest); Android: ML Kit Text Recognition
Custom models for specialized scripts (Asian languages, Devanagari, Arabic)

Translation

Per-pair translation model (Marian, NLLB, Helsinki-NLP family on-device)
Cloud fallback (Google, DeepL, Microsoft Translator APIs)
Cache common phrases to reduce repeated calls
Quality varies wildly by pair and domain (signage vs prose vs idiom)

The AR overlay

The killer feature. Steps:

Detect the text region with bounding polygon (not just rectangle)
Sample background color around the text
Render translated text on a colored rectangle that masks the source
Match font size, color, orientation as best as possible
Track the region frame-to-frame so the overlay sticks to moving text

Frame-to-frame tracking

Run OCR on every Nth frame (e.g., every 3rd) for cost
Track detected boxes between OCR runs using optical flow
Avoid re-translating identical text — cache by region content
If text leaves frame and reappears, the cache may still hit

Latency budget

Frame capture: 33 ms (30 fps)
OCR: 50–200 ms on-device (model-dependent)
Translation: 30–100 ms on-device
Render: 16 ms
Total: under 300 ms for a smooth-feeling experience

Hit this by running OCR and translation off the main thread, in parallel with the next frame.

Voice mode

ASR captures speech in source language
Translation produces target text
TTS speaks the translation
Conversation mode: turn-taking with the other speaker; speech detection on both sides
Same latency budget but harder; usually 1–2 seconds end-to-end is acceptable for speech

Privacy

Camera frames are sensitive — explicit user permission with rationale
On-device processing is the default for live preview
If sending frames to the cloud, document what is sent, retain briefly, allow opt-out
Voice in conversation mode raises eavesdropping concerns; design with consent indicators

Battery and thermal

Continuous OCR + translation is heavy on the Neural Engine / NPU
Reduce frame rate when device is hot
Stop processing when screen is dimmed or app backgrounded
Battery indicator if user is in continuous mode for long periods

Edge cases interviewers love

Curved text (e.g., wine bottle labels) — model must handle
Stylized fonts in signage — quality drops
Very small text — OCR confidence threshold
Mixed languages on one page
Right-to-left scripts (Arabic, Hebrew)
Glare, shadows, occlusion

What separates senior from staff

Senior candidates draw the OCR+translate+overlay pipeline. Staff candidates discuss the on-device-vs-cloud hybrid, frame-to-frame tracking, latency budget, and the AR overlay rendering. Principal candidates raise the offline language-pack download story, the multilingual conversation-mode UX, and the privacy threat model.

Frequently Asked Questions

What ML frameworks should I use?

iOS: Vision + Core ML. Android: ML Kit + TensorFlow Lite. Custom models compiled for the Neural Engine / NPU. ML Kit translation has on-device packs for 50+ pairs.

How do I handle a low-resource language?

Cloud-only is the realistic answer; on-device models for niche languages are not yet usable. Surface a “low quality” indicator if confidence is low.

What about handwriting?

Different model from print OCR; lower accuracy. Out of scope for most products; add as a separate mode if needed.