Data Quality Service Low-Level Design: Rule Validation, Anomaly Detection, and Lineage Tracking

Data Quality Service Overview

A Data Quality Service (DQS) monitors datasets and data pipelines for accuracy, completeness, consistency, and timeliness. It runs rule-based validation checks, applies statistical anomaly detection to catch subtle drift, computes quality scores for each dataset, and tracks data lineage so that the root cause of quality failures can be traced to their origin.

Requirements

Functional Requirements

Register validation rules (null checks, range constraints, uniqueness, referential integrity, regex patterns) against datasets and columns.
Execute validation rules on scheduled runs or triggered by pipeline completion events.
Detect statistical anomalies: column-level distribution shifts, unexpected row-count changes, and metric value spikes using historical baselines.
Compute a composite quality score per dataset run and persist it as a time series.
Track data lineage: record which upstream datasets and transformations produced each output dataset.
Alert owners when quality scores drop below configurable thresholds or anomalies are detected.

Non-Functional Requirements

Validation of a 100 GB table partition completes within 10 minutes using distributed execution.
Lineage graph supports at least 10,000 nodes and 50,000 edges with sub-second query latency.
Rule definitions are versioned; historical runs reference the rule version that was active at execution time.

Data Model

The ValidationRule table stores: rule_id UUID, dataset_ref, column_ref (nullable for table-level rules), rule_type ENUM (not-null, range, unique, regex, custom-sql), parameters JSON, severity ENUM (warning, error, critical), version INT, and owner_team.

The QualityRunResult table records: run_id UUID, dataset_ref, triggered_by (schedule, pipeline-event, manual), start_time, end_time, rule_results JSON (array of per-rule pass/fail with row counts), quality_score FLOAT, and anomaly_flags JSON.

The LineageEdge table captures: edge_id UUID, source_dataset_ref, target_dataset_ref, transformation_id, transformation_type (SQL, Spark job, dbt model, API), and recorded_at TIMESTAMP. Edges are stored in a graph database (or an adjacency list in a relational store with recursive CTE support) for efficient upstream and downstream traversal.

Core Algorithms

Rule-Based Validation

Each validation rule is translated to an executable query plan. For SQL-accessible data (warehouses, lake tables), rules are compiled to SQL expressions run as COUNT queries against the target partition. The DQS worker submits these queries in parallel batches, collects counts, and compares them against thresholds to produce pass/fail verdicts. For streaming data, rules are embedded as Flink or Spark Structured Streaming assertions evaluated per micro-batch.

Statistical Anomaly Detection

For each monitored metric (row count, null rate, mean, standard deviation, percentiles), the service maintains a rolling baseline using a 30-day exponential moving average and standard deviation. A new observation is flagged as anomalous if it deviates more than three standard deviations from the baseline, or if a Z-score test over the last 7 observations exceeds a configurable threshold. Seasonal patterns (daily, weekly) are handled by computing separate baselines per hour-of-day and day-of-week bucket.

Quality Score Computation

The quality score for a run is a weighted average of rule pass rates, where weights are derived from severity: critical rules carry weight 1.0, error rules 0.5, and warning rules 0.2. The score is in [0, 100]. A dataset with no critical failures and minor warning-level issues scores above 80, which is the default alerting threshold. Scores are stored as a time series to support trend analysis and SLA reporting.

API Design

RegisterRule(ValidationRule) → RuleId — creates or updates a validation rule; increments version on update.
TriggerRun(DatasetRef, RunOptions) → RunId — starts a validation run; returns a run ID for polling.
GetRunResult(RunId) → QualityRunResult — returns the full result including per-rule verdicts, quality score, and anomaly flags.
GetQualityTimeSeries(DatasetRef, TimeRange) → QualityScoreSeries — returns the quality score history for trend visualization.
GetLineage(DatasetRef, Direction, Depth) → LineageGraph — returns upstream or downstream lineage up to a specified hop depth.

Scalability and Fault Tolerance

Validation runs are dispatched as tasks to a distributed worker pool (Celery, Temporal, or similar). Each rule validation task is independent and idempotent; failed tasks are retried up to three times with exponential backoff. The run coordinator aggregates task results as they complete and writes the final QualityRunResult only after all tasks finish or timeout.

For very large tables, the DQS partitions the validation work: each worker validates a partition slice (e.g., one day of data), and results are merged by the coordinator. This keeps individual task duration under 60 seconds regardless of total table size.

The lineage graph is updated by pipeline instrumentation libraries that emit lineage events to a Kafka topic. A dedicated consumer writes edges to the graph store. Lineage events are idempotent (upsert on source + target + transformation); duplicate emissions cause no inconsistency.

Monitoring

Track quality score trends per dataset; alert when a 7-day rolling average drops more than 5 points.
Monitor validation run duration; alert if a run exceeds its expected window by 50%, indicating query performance regression.
Track the fraction of critical-severity rule failures per day as a platform-wide data health KPI.
Alert on lineage gaps: datasets with no recorded lineage edges may indicate uninstrumented pipelines.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a rule-based validation engine enforce data quality?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Rules are defined as assertions on datasets (e.g., 'column X must not be null', 'value must be in set {A,B,C}', 'row count must be within 10% of yesterday'). The engine evaluates each rule on each pipeline run and emits pass/fail results with counts of violating rows. Failures above a severity threshold block downstream consumers or trigger alerts.”
}
},
{
“@type”: “Question”,
“name”: “How is statistical anomaly detection with Z-score used in data quality monitoring?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “For each metric (row count, null rate, mean value), the system maintains a rolling mean and standard deviation. A Z-score measures how many standard deviations the current observation is from the historical mean. A Z-score beyond ±3 flags an anomaly. This catches issues that static thresholds miss when the baseline drifts seasonally.”
}
},
{
“@type”: “Question”,
“name”: “What goes into a data quality score for a dataset?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A quality score is a weighted composite of dimensions: completeness (non-null rate), validity (rule pass rate), freshness (time since last successful load), uniqueness (duplicate rate), and consistency (cross-table referential integrity). Dimension weights are configurable per dataset. The score is displayed in the data catalog and used to gate high-stakes consumers.”
}
},
{
“@type”: “Question”,
“name”: “What are lineage-linked alerts in a data quality system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a quality check fails on an upstream dataset, lineage-linked alerts automatically notify the owners of all downstream datasets and dashboards that depend on it. The alert includes the lineage path so recipients understand impact without manual investigation. This prevents silent propagation of bad data through a pipeline.”
}
}
]
}