How does a canary analysis system establish a metric baseline for comparison?

The system samples metrics from the control (stable) deployment over a configurable baseline window, computing percentile distributions for latency, error rate, and throughput. These distributions serve as the reference against which canary metrics are compared, with the window length tunable to filter out transient noise.

Why is the Mann-Whitney U test used in canary analysis instead of a simple mean comparison?

The Mann-Whitney U test is a non-parametric rank-based test that doesn't assume a normal distribution, making it robust to the heavy-tailed latency distributions common in production systems. It detects whether canary metric samples are drawn from the same distribution as baseline samples without being skewed by outliers the way a mean comparison would be.

What configurable threshold types does a canary analysis service typically support?

Threshold types include absolute value thresholds (e.g., p99 latency must stay under 200 ms), relative deviation thresholds (e.g., error rate must not increase more than 10% over baseline), and statistical significance thresholds (e.g., Mann-Whitney p-value must exceed 0.05). Each metric can carry its own threshold type and severity level.

How does an automated canary analysis system emit a pass/fail verdict?

After collecting metrics for the full analysis window, the system evaluates every configured metric against its threshold. If any critical-severity metric fails its threshold, the overall verdict is FAIL and the canary is automatically rolled back. If only warning-severity metrics fail, the verdict may be MARGINAL, triggering a human review gate before promotion.

Canary Analysis Service Low-Level Design: Metric Comparison, Statistical Tests, and Pass/Fail Verdict

⏱ 5 min read

What Is a Canary Analysis Service?

A canary analysis service automates the statistical comparison of a new software version (the canary) against the stable baseline. Rather than relying on engineers to manually read dashboards during a canary deployment, the service continuously collects metrics from both populations, applies statistical tests, and emits a pass or fail verdict that can gate further rollout or trigger automatic rollback.

Requirements

Functional Requirements

Accept a canary analysis configuration specifying metric sources, baseline and canary identifiers, and success criteria.
Continuously fetch metric samples from a time-series backend (Prometheus, Datadog, or similar).
Apply configurable statistical tests to compare canary and baseline distributions.
Emit a verdict (PASS, FAIL, INCONCLUSIVE) after a configured observation window.
Expose the verdict and per-metric results via API and webhook callback.

Non-Functional Requirements

Support concurrent analyses for dozens of simultaneous canary deployments.
Complete verdict computation within 5 seconds of the observation window closing.
Provide audit logs of every metric sample and intermediate verdict decision.

Data Model

An AnalysisRun record is created when a deployment system initiates a canary check.

run_id (UUID)
canary_version, baseline_version (strings)
metric_configs (JSONB array of metric name, query template, threshold, direction)
start_time, end_time, window_minutes
status (ENUM: running, pass, fail, inconclusive, cancelled)

Each metric comparison is stored as a MetricResult row: run_id, metric_name, canary_samples (float array), baseline_samples (float array), p_value, effect_size, verdict (pass/fail), evaluated_at.

Core Algorithms

Mann-Whitney U Test

The service uses the Mann-Whitney U (Wilcoxon rank-sum) test as its primary statistical test because it makes no assumption about the underlying distribution — latency and error-rate distributions are rarely normal. The test determines whether samples from the canary population are stochastically greater or less than the baseline population.

Combine and rank all samples from both populations.
Compute U statistics for both groups: U1 = R1 – n1*(n1+1)/2, where R1 is the sum of ranks for group 1.
Convert U to a Z score for large samples (n > 20), then derive a two-tailed p-value.
If p-value is below the configured significance level (default 0.05) and the effect is in the failing direction, mark the metric as failed.

Effect Size Guard

Statistical significance alone is insufficient with large sample sizes. The service also computes a relative difference: (canary_median – baseline_median) / baseline_median. A metric only fails if both the p-value threshold is crossed AND the relative difference exceeds the configured tolerance (e.g. 5% regression in p99 latency).

Scalability

Each AnalysisRun is handled by a dedicated worker goroutine that polls the metrics backend at a configurable interval (e.g. every 60 seconds). Workers are distributed across a pool of analysis nodes using a consistent hash on run_id, so a single run always lands on the same node for local state accumulation. A coordinator service tracks active runs and reassigns them if a worker node fails, using a heartbeat lease in Redis.

Metric queries are batched where the backend supports it. For Prometheus, a single range query per metric covers the full window, avoiding N+1 query patterns.

API Design

POST /analyses — start a new analysis run; returns run_id.
GET /analyses/{run_id} — return current status, elapsed time, and per-metric intermediate results.
GET /analyses/{run_id}/results — return the final verdict, per-metric verdicts, p-values, and sample summaries.
DELETE /analyses/{run_id} — cancel an in-progress run.

Webhook delivery: when a run reaches a terminal status, the service POSTs a signed JSON payload to the registered callback URL. The payload includes run_id, verdict, failed_metrics array, and a summary of effect sizes. Delivery is retried up to 5 times with exponential backoff.

Failure Modes

Metrics backend unavailable: Samples collected so far are retained. The observation window is extended by the outage duration, up to a configured maximum extension.
Insufficient samples: If either population has fewer than 10 data points at verdict time, the metric is marked INCONCLUSIVE rather than PASS or FAIL.
Worker crash: The lease expires, the coordinator reassigns the run, and the new worker resumes from the last persisted sample set.

Observability

Track active run count, per-run sample ingestion rate, metric query latency, and verdict distribution over time. Alarm when the inconclusive rate exceeds 20% across all runs — this typically indicates metrics backend instability or misconfigured queries.