Query Understanding Service Overview
The Query Understanding Service (QUS) sits between the raw user input and the retrieval layer. It enriches a plain-text query with structured annotations — intent labels, named entities, corrected spelling, and a semantically rewritten form — so that downstream retrieval and ranking components can act on richer signals than keyword tokens alone.
Requirements
Functional Requirements
- Classify query intent into a taxonomy (navigational, informational, transactional, local).
- Extract named entities: people, organizations, locations, products, and domain-specific concepts.
- Detect and correct spelling errors using a noisy-channel model and domain vocabulary.
- Rewrite ambiguous or underspecified queries into canonical forms using a seq2seq model.
- Return enriched query annotation within 30 ms at p95.
Non-Functional Requirements
- Throughput: 50,000 queries per second, horizontally scalable.
- Model updates deployable without service restart via hot-swap.
- Annotation confidence scores included so callers can apply thresholds.
Data Model
The QueryAnnotation object returned by the service contains:
raw_query— original user input.corrected_query— post-spelling-correction form.rewritten_queries[]— one or more semantically equivalent rewrites with confidence scores.intent— top intent label and probability distribution over the full taxonomy.entities[]— list of{span, entity_type, canonical_id, confidence}tuples.language— detected language code (BCP-47).
An offline Query Log table stores raw queries, resulting annotations, and downstream engagement labels for periodic model retraining. Keys: query_id UUID, session_id, timestamp, annotation JSON, click_signal BOOL.
Core Algorithms
Intent Classification
A fine-tuned BERT-class transformer trained on human-labeled query-intent pairs produces a softmax over four top-level intent buckets and up to 200 fine-grained sub-intents. The model is quantized (INT8) and served via ONNX Runtime to meet the 30 ms budget. A lightweight rule layer intercepts high-confidence patterns (e.g., URL-like strings map directly to navigational intent) before the neural model runs.
Named Entity Recognition
A sequence-labeling NER model (BERT fine-tuned with BIO tagging) identifies entity spans. Recognized spans are linked to a knowledge-graph canonical ID via a fuzzy prefix trie over entity surface forms. Ambiguous spans are resolved using surrounding context tokens.
Spelling Correction
A noisy-channel model scores candidate corrections as P(intended | observed) ∝ P(observed | intended) × P(intended). Edit distance candidates are generated using a weighted Damerau-Levenshtein automaton. The language model prior is a trigram model trained on query logs, compressed with KenLM for sub-millisecond lookup.
Semantic Query Rewriting
A seq2seq T5-small model trained on (query, expanded query) pairs generates alternative phrasings. Rewrites are ranked by a cross-encoder relevance model against the corrected query and the top-three are surfaced with confidence scores. Downstream retrieval may issue parallel searches across original and rewritten forms and merge results.
API Design
- AnnotateQuery(QueryRequest) → QueryAnnotation — main synchronous call; returns full annotation object.
- BatchAnnotate(BatchQueryRequest) → BatchQueryAnnotation — used by offline evaluation and training pipelines.
- GetModelVersion() → ModelVersionInfo — returns checksums and training metadata for each deployed sub-model; used by monitoring dashboards.
The service is exposed as a gRPC endpoint with a JSON/HTTP transcoding layer for tooling compatibility. Callers pass an optional annotation_mask field to request only a subset of annotations and reduce latency.
Scalability and Fault Tolerance
Each QUS instance is stateless; horizontal scaling is achieved by adding replicas behind a load balancer. Models are loaded once at startup into shared memory, keeping per-request heap allocations near zero. A model registry service (backed by object storage) publishes new model artifacts; instances poll for updates every 60 seconds and perform a non-blocking hot-swap. If the swap fails, the previous model remains active and an alert fires.
Circuit breakers protect each sub-component. If the NER model exceeds its latency budget, the service returns a partial annotation (intent and correction only) rather than timing out the full request. SLO: full annotation p95 ≤ 30 ms, partial annotation p99 ≤ 15 ms.
Monitoring
- Track intent distribution drift over 1-hour rolling windows; alert if any bucket shifts by more than 10% relative.
- Monitor correction acceptance rate (fraction of corrected queries that receive user engagement) as a proxy for correction quality.
- Log annotation confidence histograms to detect model degradation between retraining cycles.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is BERT used for intent classification in query understanding?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A fine-tuned BERT encoder maps the raw query string to a fixed-size representation that is passed to a classification head. The model is trained on labeled query-intent pairs (e.g., navigational, informational, transactional) and served behind a low-latency inference endpoint, often with quantization or distillation to meet p99 latency budgets.”
}
},
{
“@type”: “Question”,
“name”: “What is NER entity extraction with knowledge-base linking in search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Named Entity Recognition (NER) identifies spans in a query (people, places, products). KB linking then resolves each span to a canonical entity in a knowledge graph, disambiguating 'Apple' as the company versus the fruit. Linked entities unlock structured query expansion and entity-specific ranking boosts.”
}
},
{
“@type”: “Question”,
“name”: “How does spelling correction fit into a query understanding pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Spelling correction typically uses a noisy-channel model or a seq2seq neural corrector to propose alternative query strings. The correction is applied early in the pipeline so downstream components see a cleaner signal. High-confidence corrections are applied silently; lower-confidence ones surface a 'Did you mean?' suggestion.”
}
},
{
“@type”: “Question”,
“name”: “What is semantic query rewriting and when is it applied?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Semantic query rewriting transforms a user's query into an alternative form that retrieves better results — for example, expanding 'cheap flights NYC' to include synonyms or related concepts. It is applied after intent classification and spelling correction, using embedding similarity or a fine-tuned generative model, and is validated offline against click-based relevance judgments.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering