Data Lineage Service Low-Level Design: Column-Level Tracking, DAG Storage, and Impact Analysis

Data Lineage Service: Overview and Requirements

A data lineage service tracks the origin, transformation, and movement of data across a pipeline — from raw ingestion through every ETL step to final consumption. Interviewers expect you to address column-level granularity, not just table-level, because that is where compliance and debugging questions become real.

Functional Requirements

Record column-to-column transformation edges across jobs and datasets.
Store upstream and downstream dependency graphs queryable in both directions.
Surface impact analysis: given a source column, show every downstream artifact affected by a change.
Support multiple compute engines (Spark, dbt, Flink) through a pluggable metadata emitter API.
Expose lineage to a UI and to programmatic consumers via REST and GraphQL.

Non-Functional Requirements

Ingestion latency under 500 ms for event-driven emitters.
Graph queries returning full lineage chains within 2 seconds for graphs up to 10 million edges.
At-least-once delivery for lineage events with idempotent writes.
Retention of historical runs for audit compliance (typically 7 years).

Data Model

The core abstraction is a directed acyclic graph (DAG) where nodes are dataset columns and edges represent transformation relationships produced by a specific job run.

Node Schema

column_id — surrogate UUID.
dataset_fqn — fully qualified name: platform.database.schema.table.column.
data_type, nullable, description — intrinsic metadata.
created_at, updated_at — version timestamps.

Edge Schema

edge_id — UUID.
source_column_id, target_column_id — foreign keys to nodes.
job_run_id — links to the job execution that produced this edge.
transformation_sql — optional snippet describing the logic.
valid_from, valid_to — temporal validity for lineage versioning.

Store the graph in a native graph database such as Apache Atlas (JanusGraph under the hood) or Neo4j. For teams already on a relational stack, a closure table or adjacency list in PostgreSQL works up to a few million edges before traversal cost dominates.

Core Algorithms

Graph Traversal for Impact Analysis

Both upstream (root cause) and downstream (impact) queries map to BFS or DFS over the edge table. Use iterative BFS to avoid stack overflow on deep graphs. Memoize visited nodes to handle diamond-shaped lineage without revisiting.

Upstream query: traverse in-edges recursively to find all sources that feed a target column.
Downstream query: traverse out-edges recursively to find all consumers of a source column.
Shortest path: used in root-cause analysis to surface the most direct transformation chain.

Cycle Detection

Pipelines should be acyclic, but misconfiguration happens. Run a topological sort (Kahn algorithm) on ingest; reject any edge batch that would introduce a cycle and return a descriptive error to the emitter.

Scalability Design

Lineage events arrive as Kafka messages published by instrumented job frameworks. A lineage ingestor service consumes the topic, deduplicates by (source_column_id, target_column_id, job_run_id), and upserts edges in batch.

Partition Kafka by dataset namespace to preserve ordering within a pipeline.
Use a read-through cache (Redis) for the most frequently queried lineage paths — invalidated on any edge write touching those nodes.
Materialize impact lists for high-fan-out source columns asynchronously; serve from a pre-computed store for sub-100 ms reads.
Archive edges older than the hot retention window (90 days) to columnar storage (Parquet on S3) and answer historical queries via Athena or Trino.

API Design

REST Endpoints

POST /v1/lineage/events — batch-ingest lineage edges from a job run. Accepts an array of edge objects; returns ingestion receipt with accepted/rejected counts.
GET /v1/lineage/upstream?column_fqn={fqn}&depth={n} — returns the upstream dependency subgraph up to n hops.
GET /v1/lineage/downstream?column_fqn={fqn}&depth={n} — returns the downstream impact subgraph.
GET /v1/lineage/impact-report?column_fqn={fqn} — returns a structured report listing affected dashboards, models, and SLA-tagged datasets.

GraphQL

Expose a GraphQL schema so UI consumers can compose custom traversal queries in a single round trip. Use DataLoader to batch node lookups and avoid N+1 fetches during graph hydration.

Observability and Operations

Emit a lineage coverage metric: percentage of job runs that successfully published lineage events, broken down by engine type.
Alert when ingestor consumer lag on the Kafka topic exceeds 5 minutes, indicating a processing bottleneck.
Provide an admin endpoint to reprocess a job run range — needed when an emitter bug produced malformed edges.
Track graph growth rate (edges per day) as a capacity planning signal for the graph store.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does column-level lineage tracking differ from table-level lineage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Column-level lineage tracks transformations at the field granularity, recording which source columns contributed to each target column through SQL parsing, logical plan inspection, or runtime instrumentation. This enables precise impact analysis when a column definition changes, rather than flagging an entire table as affected.”
}
},
{
“@type”: “Question”,
“name”: “What DAG storage model is used for data lineage graphs?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Lineage graphs are stored as directed acyclic graphs (DAGs) typically in a graph database (Neo4j, Amazon Neptune) or an adjacency-list table in a relational store. Nodes represent datasets or columns, edges represent transformation steps with metadata such as job ID, timestamp, and transformation expression.”
}
},
{
“@type”: “Question”,
“name”: “How do you traverse upstream and downstream impact in a lineage graph?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Upstream traversal follows edges in reverse from a target node to find all ancestor sources; downstream traversal follows edges forward to find all dependent consumers. BFS or DFS with cycle detection is used, and results can be depth-limited to control query cost. Graph databases expose native shortest-path and neighborhood queries for this.”
}
},
{
“@type”: “Question”,
“name”: “What are the trade-offs between capturing lineage at query time versus ingest time?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Query-time lineage capture (parsing SQL or intercepting the query planner) is comprehensive but adds latency to every query and requires deep integration with the execution engine. Ingest-time capture (logging lineage when a pipeline writes data) is lower overhead and decoupled, but misses ad-hoc queries and requires producers to instrument themselves.”
}
}
]
}