Query Understanding Service Low-Level Design: Intent Classification, Entity Extraction, and Query Rewriting

Query Understanding Service Overview

The Query Understanding Service (QUS) sits between the raw user input and the retrieval layer. It enriches a plain-text query with structured annotations — intent labels, named entities, corrected spelling, and a semantically rewritten form — so that downstream retrieval and ranking components can act on richer signals than keyword tokens alone.

Requirements

Functional Requirements

Classify query intent into a taxonomy (navigational, informational, transactional, local).
Extract named entities: people, organizations, locations, products, and domain-specific concepts.
Detect and correct spelling errors using a noisy-channel model and domain vocabulary.
Rewrite ambiguous or underspecified queries into canonical forms using a seq2seq model.
Return enriched query annotation within 30 ms at p95.

Non-Functional Requirements

Throughput: 50,000 queries per second, horizontally scalable.
Model updates deployable without service restart via hot-swap.
Annotation confidence scores included so callers can apply thresholds.

Data Model

The QueryAnnotation object returned by the service contains:

raw_query — original user input.
corrected_query — post-spelling-correction form.
rewritten_queries[] — one or more semantically equivalent rewrites with confidence scores.
intent — top intent label and probability distribution over the full taxonomy.
entities[] — list of {span, entity_type, canonical_id, confidence} tuples.
language — detected language code (BCP-47).

An offline Query Log table stores raw queries, resulting annotations, and downstream engagement labels for periodic model retraining. Keys: query_id UUID, session_id, timestamp, annotation JSON, click_signal BOOL.

Core Algorithms

Intent Classification

A fine-tuned BERT-class transformer trained on human-labeled query-intent pairs produces a softmax over four top-level intent buckets and up to 200 fine-grained sub-intents. The model is quantized (INT8) and served via ONNX Runtime to meet the 30 ms budget. A lightweight rule layer intercepts high-confidence patterns (e.g., URL-like strings map directly to navigational intent) before the neural model runs.

Named Entity Recognition

A sequence-labeling NER model (BERT fine-tuned with BIO tagging) identifies entity spans. Recognized spans are linked to a knowledge-graph canonical ID via a fuzzy prefix trie over entity surface forms. Ambiguous spans are resolved using surrounding context tokens.

Spelling Correction

A noisy-channel model scores candidate corrections as P(intended | observed) ∝ P(observed | intended) × P(intended). Edit distance candidates are generated using a weighted Damerau-Levenshtein automaton. The language model prior is a trigram model trained on query logs, compressed with KenLM for sub-millisecond lookup.

Semantic Query Rewriting

A seq2seq T5-small model trained on (query, expanded query) pairs generates alternative phrasings. Rewrites are ranked by a cross-encoder relevance model against the corrected query and the top-three are surfaced with confidence scores. Downstream retrieval may issue parallel searches across original and rewritten forms and merge results.

API Design

AnnotateQuery(QueryRequest) → QueryAnnotation — main synchronous call; returns full annotation object.
BatchAnnotate(BatchQueryRequest) → BatchQueryAnnotation — used by offline evaluation and training pipelines.
GetModelVersion() → ModelVersionInfo — returns checksums and training metadata for each deployed sub-model; used by monitoring dashboards.

The service is exposed as a gRPC endpoint with a JSON/HTTP transcoding layer for tooling compatibility. Callers pass an optional annotation_mask field to request only a subset of annotations and reduce latency.

Scalability and Fault Tolerance

Each QUS instance is stateless; horizontal scaling is achieved by adding replicas behind a load balancer. Models are loaded once at startup into shared memory, keeping per-request heap allocations near zero. A model registry service (backed by object storage) publishes new model artifacts; instances poll for updates every 60 seconds and perform a non-blocking hot-swap. If the swap fails, the previous model remains active and an alert fires.

Circuit breakers protect each sub-component. If the NER model exceeds its latency budget, the service returns a partial annotation (intent and correction only) rather than timing out the full request. SLO: full annotation p95 ≤ 30 ms, partial annotation p99 ≤ 15 ms.

Monitoring

Track intent distribution drift over 1-hour rolling windows; alert if any bucket shifts by more than 10% relative.
Monitor correction acceptance rate (fraction of corrected queries that receive user engagement) as a proxy for correction quality.
Log annotation confidence histograms to detect model degradation between retraining cycles.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is BERT used for intent classification in query understanding?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A fine-tuned BERT encoder maps the raw query string to a fixed-size representation that is passed to a classification head. The model is trained on labeled query-intent pairs (e.g., navigational, informational, transactional) and served behind a low-latency inference endpoint, often with quantization or distillation to meet p99 latency budgets.”
}
},
{
“@type”: “Question”,
“name”: “What is NER entity extraction with knowledge-base linking in search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Named Entity Recognition (NER) identifies spans in a query (people, places, products). KB linking then resolves each span to a canonical entity in a knowledge graph, disambiguating 'Apple' as the company versus the fruit. Linked entities unlock structured query expansion and entity-specific ranking boosts.”
}
},
{
“@type”: “Question”,
“name”: “How does spelling correction fit into a query understanding pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Spelling correction typically uses a noisy-channel model or a seq2seq neural corrector to propose alternative query strings. The correction is applied early in the pipeline so downstream components see a cleaner signal. High-confidence corrections are applied silently; lower-confidence ones surface a 'Did you mean?' suggestion.”
}
},
{
“@type”: “Question”,
“name”: “What is semantic query rewriting and when is it applied?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Semantic query rewriting transforms a user's query into an alternative form that retrieves better results — for example, expanding 'cheap flights NYC' to include synonyms or related concepts. It is applied after intent classification and spelling correction, using embedding similarity or a fine-tuned generative model, and is validated offline against click-based relevance judgments.”
}
}
]
}