How does a string extraction pipeline work in a translation service?

The extraction pipeline parses source code and resource files (JSON, YAML, PO, XLIFF) to pull translatable strings along with their context keys and developer comments. Extracted strings are deduplicated by key and pushed to a translation management system, which tracks version history and flags strings changed since last translation.

What is translation memory fuzzy lookup and how is it implemented?

Translation memory (TM) stores previously translated source-target pairs. On a new string, the system computes edit-distance similarity (Levenshtein or TF-IDF cosine) against TM entries. Matches above a configurable threshold (e.g., 75%) are surfaced as suggestions with their match score, letting translators reuse prior work rather than retranslating from scratch.

How is machine translation fallback integrated when no TM match exists?

When TM lookup returns no match above threshold, the string is sent to an MT engine (e.g., DeepL, Google Translate) via an async job queue. The MT output is stored in the TM at a lower confidence tier and routed to the human post-edit queue so a translator can verify and upgrade its confidence tier before it ships to production.

How is the human post-edit queue managed to avoid bottlenecks?

Strings are prioritized in the post-edit queue by release deadline, traffic weight, and MT confidence score. Low-confidence MT translations for high-traffic strings jump the queue. Translators claim batches to avoid duplicate work. SLA timers fire alerts when strings age past a deadline, and auto-publish rules can optionally ship MT output after a timeout with a disclaimer flag.

Translation Service Low-Level Design: String Extraction, TM Lookup, and Human Review Queue

⏱ 5 min read

Translation Service: Low-Level Design

A translation service manages the full lifecycle of localizable strings: extraction from source code and content, lookup against a translation memory (TM), machine translation (MT) fallback, and routing unconfident segments to a human post-edit review queue. It serves both engineering teams shipping new features and localization project managers coordinating translator workflows.

Requirements

Functional

Extract translatable strings from source files (JSON, YAML, PO, XLIFF, HTML) via a CI pipeline hook
Look up existing approved translations from a TM for exact and fuzzy matches
Fall back to an MT provider (DeepL, Google Translate) when TM coverage is below a threshold
Route low-confidence MT segments to a human review queue with context (screenshot, surrounding strings)
Serve translated strings to client applications via a CDN-backed edge API
Export locale bundles in JSON, PO, and XLIFF formats

Non-Functional

String serve latency: under 20 ms p99 from CDN edge
TM lookup latency: under 50 ms p99 for fuzzy match across 10 million segments
Support 100+ target locales per project

Data Model

projects: project_id (UUID), name, source_locale, target_locales (ARRAY), created_at
strings: string_id (UUID), project_id, key (TEXT), source_text (TEXT), context (TEXT), max_length (INT), screenshot_url (TEXT), fingerprint (SHA256 of source_text), created_at
translation_memory: tm_id, source_fingerprint (SHA256), target_locale, translated_text (TEXT), quality_score (FLOAT 0-1), approved (BOOL), translator_id, updated_at
review_tasks: task_id, string_id, target_locale, mt_output (TEXT), mt_confidence (FLOAT), assigned_to, status (ENUM: pending, in_review, approved, rejected), due_at
locale_bundles: project_id, target_locale, version (INT), bundle_url (S3 pre-signed), generated_at

Core Algorithms

String Extraction

The extractor runs as a CI step. It parses source files using format-specific parsers, computes a SHA256 fingerprint of each source string, and diffs against the current strings table. New strings are inserted; modified strings (fingerprint changed) create a new string_id and deprecate the old one to preserve TM history. Deleted keys are soft-deleted and excluded from bundle generation.

Translation Memory Lookup

Exact match: query translation_memory by source_fingerprint and target_locale. If no exact match, fuzzy match uses trigram similarity (pg_trgm in PostgreSQL) against source text indexed with GIN. A similarity threshold of 0.75 qualifies as a fuzzy match; the TM entry is returned with a quality_score reduced by (1 – similarity) as a penalty. Segments with quality_score below 0.6 are routed to MT.

MT Fallback and Confidence Scoring

The MT client calls the configured provider, receives a translation, and computes a confidence score using the provider quality estimate (where available) combined with a length-ratio heuristic (translated length divided by source length outside 0.5 to 2.0 range signals suspect output). Segments with confidence below 0.7 are inserted into review_tasks; otherwise they are auto-approved and written to the TM.

Scalability and Architecture

The pipeline is event-driven. A CI webhook triggers string extraction, which publishes new/changed string events to a Kafka topic. A TM lookup worker processes each event: hit goes directly to bundle generation, miss routes to the MT worker pool, which calls the external provider with retry and circuit breaker logic. Human review tasks are inserted into Postgres and surfaced via a translator dashboard.

Bundle generation runs after all strings in a project reach approved status (or a deadline passes with best-effort output)
Bundles are stored in S3 and pushed to a CDN with a cache key of project_id + locale + version
Cache invalidation on new bundle: purge CDN edge nodes via API, bump version integer
MT provider rate limits are handled with a token bucket per provider; overflow queues to a delayed retry topic
Translation memory is replicated to a read replica for fuzzy search queries to avoid write contention

API Design

String Management

POST /v1/projects/{project_id}/strings/extract — accepts a zip of source files, starts async extraction job, returns job_id
GET /v1/projects/{project_id}/strings?locale=STRING&status=STRING — paginated list of strings with translation status per locale

Translation Fetch (Client-Facing)

GET /v1/bundles/{project_id}/{locale}/latest.json — served from CDN, returns flat key-value JSON. ETags enable conditional requests for bandwidth efficiency.

Review Queue

GET /v1/review-tasks?assignee=ME&locale=STRING — fetch pending review tasks with context
PATCH /v1/review-tasks/{task_id} — body: {status, edited_text} — approve or reject an MT segment

Interview Tips

Key discussion points include the tradeoff between auto-approval confidence thresholds and human review workload: lowering the threshold to 0.6 doubles review tasks but catches more MT errors. Discuss version management when a source string changes mid-sprint: deprecate the old string_id rather than mutating it so in-flight translations remain valid. For the fuzzy TM lookup, interviewers expect you to know that pg_trgm GIN indexes support similarity queries efficiently and scale to tens of millions of rows without Elasticsearch.