“Design a translation camera” is a vision-and-AR mobile system design prompt — Google Lens / Translate, Apple’s Live Text + Translate, Microsoft Translator. The interview probes whether you understand on-device ML, the AR overlay for live text replacement, and the unique latency and privacy tradeoffs of real-time camera intelligence.
Clarify scope
- Real-time camera overlay or capture-and-translate?
- How many language pairs?
- Offline or always-online?
- Voice translation in scope?
- Conversation mode (back-and-forth speech)?
The pipeline
- Camera frame captured (~30 fps preview)
- OCR detects text regions and reads characters
- Text grouped into translatable segments
- Translation model produces target-language text
- AR overlay replaces source text with translated text on the live preview
- Audio output if voice mode
On-device vs cloud
The latency and privacy story drives this:
- On-device: private, works offline, no per-query cost, lower quality on rare languages
- Cloud: higher quality, all languages, requires connectivity, privacy concerns with sending camera frames
- Hybrid: on-device for common languages and OCR; cloud fallback for rare languages or when explicit “high quality” tap
Modern apps lean on-device for live preview and cloud for post-capture detail.
OCR engineering
- Detection: localize text regions in the frame (ML model)
- Recognition: read characters within each region
- Frame-to-frame stability: smooth detected boxes; do not flicker
- Text grouping: combine line-level reads into translatable segments
- iOS: Vision framework (VNRecognizeTextRequest); Android: ML Kit Text Recognition
- Custom models for specialized scripts (Asian languages, Devanagari, Arabic)
Translation
- Per-pair translation model (Marian, NLLB, Helsinki-NLP family on-device)
- Cloud fallback (Google, DeepL, Microsoft Translator APIs)
- Cache common phrases to reduce repeated calls
- Quality varies wildly by pair and domain (signage vs prose vs idiom)
The AR overlay
The killer feature. Steps:
- Detect the text region with bounding polygon (not just rectangle)
- Sample background color around the text
- Render translated text on a colored rectangle that masks the source
- Match font size, color, orientation as best as possible
- Track the region frame-to-frame so the overlay sticks to moving text
Frame-to-frame tracking
- Run OCR on every Nth frame (e.g., every 3rd) for cost
- Track detected boxes between OCR runs using optical flow
- Avoid re-translating identical text — cache by region content
- If text leaves frame and reappears, the cache may still hit
Latency budget
- Frame capture: 33 ms (30 fps)
- OCR: 50–200 ms on-device (model-dependent)
- Translation: 30–100 ms on-device
- Render: 16 ms
- Total: under 300 ms for a smooth-feeling experience
Hit this by running OCR and translation off the main thread, in parallel with the next frame.
Voice mode
- ASR captures speech in source language
- Translation produces target text
- TTS speaks the translation
- Conversation mode: turn-taking with the other speaker; speech detection on both sides
- Same latency budget but harder; usually 1–2 seconds end-to-end is acceptable for speech
Privacy
- Camera frames are sensitive — explicit user permission with rationale
- On-device processing is the default for live preview
- If sending frames to the cloud, document what is sent, retain briefly, allow opt-out
- Voice in conversation mode raises eavesdropping concerns; design with consent indicators
Battery and thermal
- Continuous OCR + translation is heavy on the Neural Engine / NPU
- Reduce frame rate when device is hot
- Stop processing when screen is dimmed or app backgrounded
- Battery indicator if user is in continuous mode for long periods
Edge cases interviewers love
- Curved text (e.g., wine bottle labels) — model must handle
- Stylized fonts in signage — quality drops
- Very small text — OCR confidence threshold
- Mixed languages on one page
- Right-to-left scripts (Arabic, Hebrew)
- Glare, shadows, occlusion
What separates senior from staff
Senior candidates draw the OCR+translate+overlay pipeline. Staff candidates discuss the on-device-vs-cloud hybrid, frame-to-frame tracking, latency budget, and the AR overlay rendering. Principal candidates raise the offline language-pack download story, the multilingual conversation-mode UX, and the privacy threat model.
Frequently Asked Questions
What ML frameworks should I use?
iOS: Vision + Core ML. Android: ML Kit + TensorFlow Lite. Custom models compiled for the Neural Engine / NPU. ML Kit translation has on-device packs for 50+ pairs.
How do I handle a low-resource language?
Cloud-only is the realistic answer; on-device models for niche languages are not yet usable. Surface a “low quality” indicator if confidence is low.
What about handwriting?
Different model from print OCR; lower accuracy. Out of scope for most products; add as a separate mode if needed.