Search Quality Service Low-Level Design: Relevance Evaluation, Click Models, and Automated Testing

Search Quality Service Overview

The Search Quality Service (SQS) provides the measurement and feedback loop that keeps a search system relevant over time. It combines human relevance judgments, click-based behavioral signals, NDCG-family metrics, and automated regression pipelines to detect quality changes before they reach production users at scale.

Requirements

Functional Requirements

Collect and store human relevance judgments (perfect, excellent, good, fair, bad) for (query, document) pairs.
Compute NDCG, MRR, and ERR metrics over human-judged query sets on demand and on a scheduled basis.
Build and update click models (e.g., DBN, PBM) from production click logs to produce unbiased relevance estimates.
Run automated regression tests against ranking model candidates before promotion to production.
Expose a dashboard API for experiment comparison and metric trend visualization.

Non-Functional Requirements

Metric computation over 10,000 queries completes within 5 minutes.
Human judgment platform supports 500 concurrent annotators.
Regression test suite runs within 30 minutes to fit a CI/CD pipeline gate.

Data Model

The JudgmentRecord table stores: judgment_id UUID, query TEXT, doc_id, url, relevance_grade TINYINT (0–4 scale), annotator_id, created_at TIMESTAMP, and annotation_context JSON (SERP snapshot at judgment time). Multiple annotators judge the same pair; grades are aggregated via majority vote or a MACE model to handle annotator disagreement.

The ClickLog table stores: session_id, query_id, doc_id, position INT, clicked BOOL, dwell_ms INT, timestamp. It is partitioned by date and retained for 90 days to feed click model training.

The MetricSnapshot table records computed metric values: snapshot_id, experiment_id, metric_name, value FLOAT, query_set_id, computed_at TIMESTAMP. This powers trend charts and regression alerts.

Core Algorithms

NDCG Computation

For each query in the evaluation set, the ranker produces a result list. The service fetches the human relevance grades for each (query, doc) pair and computes DCG using the standard logarithmic discount. Ideal DCG is computed over the sorted grade list. NDCG is DCG / IDCG. Macro-averaged NDCG@10 across the query set is the primary headline metric.

Click Model Training

The Dynamic Bayesian Network (DBN) model treats clicks as noisy observations of attractiveness (probability a user clicks position k given they examine it) and satisfaction (probability a user stops after clicking). Parameters are estimated via EM over the session click log. The resulting per-(query, doc) examination-corrected relevance scores supplement sparse human judgments for long-tail queries.

Automated Regression Testing

Each candidate ranker is evaluated against a frozen held-out query set. A two-sided Wilcoxon signed-rank test checks whether the NDCG difference between candidate and baseline is statistically significant (p < 0.05). The regression gate also enforces: no more than 2% relative NDCG degradation on any query-intent bucket, no increase in zero-result rate, and no increase in spam-result rate. Failure on any gate blocks promotion.

API Design

SubmitJudgment(JudgmentRequest) → JudgmentId — annotator tool calls this to persist a relevance grade.
ComputeMetrics(MetricRequest) → MetricReport — triggers NDCG/MRR computation for a given experiment and query set; returns synchronously for sets under 1,000 queries, or a job ID for larger sets.
RunRegressionSuite(CandidateSpec) → RegressionResult — invoked by CI/CD; returns pass/fail per gate with metric deltas.
GetMetricTrend(ExperimentId, MetricName, TimeRange) → TimeSeriesData — powers dashboard charts.

Scalability and Fault Tolerance

Metric computation jobs run on a distributed batch executor (e.g., Spark or a work-queue with parallel workers). Each query is evaluated independently, enabling trivial parallelism; 10,000 queries complete in under 5 minutes on a 50-worker pool. Results are checkpointed so a worker failure resumes rather than restarts.

The human judgment platform uses an async task queue. Annotator sessions are isolated; losing one session does not affect others. Judgment writes are idempotent (upsert on judgment_id), so network retries are safe.

Click model training runs nightly as an offline batch job. The trained model artifact is versioned in object storage; the service loads it at startup and hot-swaps on new artifact publication without downtime.

Monitoring

Alert on NDCG drops greater than 2% relative in scheduled nightly evaluations against the production ranker.
Track annotator agreement rate (Fleiss kappa); alert if it falls below 0.6, indicating ambiguous judgment guidelines.
Monitor click model perplexity on held-out sessions as a proxy for log distribution drift.
Publish regression suite pass rate per ranker release to an engineering dashboard for release-gate visibility.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is human relevance judgment collection structured for search quality?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raters are shown a query and a set of results and asked to assign a relevance grade (e.g., 0–4). Tasks are pooled from production traffic and sampled to cover head, torso, and tail queries. Inter-rater agreement is measured with Cohen's kappa, and rater calibration sessions are run regularly to maintain consistency.”
}
},
{
“@type”: “Question”,
“name”: “How is NDCG computed as a search quality metric?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “NDCG (Normalized Discounted Cumulative Gain) sums the relevance grades of returned documents, discounted by log position, then divides by the ideal DCG achievable for that query. A score of 1.0 means the ranking is perfect. NDCG@10 is the most common variant, measuring quality over the first page of results.”
}
},
{
“@type”: “Question”,
“name”: “What is a DBN click model and how does it estimate document relevance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Dynamic Bayesian Network (DBN) click model treats clicks as noisy signals of relevance. It models the probability that a user examines a position and, if examined, clicks based on true relevance. By fitting the model to click logs across many queries, it produces position-bias-corrected relevance estimates usable for training ranking models.”
}
},
{
“@type”: “Question”,
“name”: “How does an automated regression gate work in search quality CI/CD?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An automated regression gate runs the candidate ranking model against a held-out labeled query set and computes NDCG and other metrics. If the score drops below a threshold relative to the baseline, the gate blocks the deployment. This prevents regressions from reaching production without a manual override and documented justification.”
}
}
]
}