Service Dependency Map Low-Level Design: Topology Discovery, Graph Storage, and Change Detection

A service dependency map provides a real-time, queryable graph of which services call which other services, derived automatically from distributed trace data. This design covers topology discovery from spans, graph storage, change detection algorithms, and downstream impact analysis.

Requirements

Functional

Derive service-to-service edges automatically from span parent-child relationships.
Update the topology graph continuously as new trace data arrives.
Detect when edges appear or disappear and emit change events.
Answer impact queries: given a service, return all downstream dependents.
Provide a versioned history of topology snapshots for diffing.

Non-Functional

Graph update latency under 60 seconds from span ingestion.
Support graphs with up to 10,000 service nodes and 100,000 edges.
Impact query response under 200 ms.

Data Model

ServiceNode — serviceId (hash of service name), serviceName, firstSeenAt, lastSeenAt, attributes (team, tier, language from resource attributes).
DependencyEdge — edgeId (hash of source+destination+operation), sourceService, destinationService, operationName, callCount, errorCount, p50LatencyMs, p99LatencyMs, windowStart, windowEnd.
TopologySnapshot — snapshotId, capturedAt, nodes (set of service names), edges (set of source-destination pairs), checksum.
TopologyChangeEvent — eventId, changeType (EDGE_ADDED, EDGE_REMOVED, NODE_ADDED, NODE_REMOVED), subject, detectedAt, snapshotBefore, snapshotAfter.

Topology Discovery from Traces

For each span with a parentSpanId, look up the parent span. If the parent span has a different serviceName, emit a directed edge from the parent service to the child service, labeled with the child span operation name. This is a caller-to-callee relationship: the parent service called the child service. Aggregate edges over a sliding time window (e.g., one minute) to produce DependencyEdge records with call count and latency statistics.

Handle the case where the parent span is in a different collector shard by joining on the Kafka span stream. A topology builder consumes all spans, maintains a short-lived span cache (TTL matching trace assembly grace period), and performs the parent lookup from cache before emitting edges.

Core Algorithms

Graph Storage

Store the live topology in a graph database (Neo4j, Amazon Neptune) or as an adjacency list in Redis. For the adjacency list approach: SADD deps:SOURCE DESTINATION for each edge and SADD rdeps:DESTINATION SOURCE for reverse lookup. Expire keys with a rolling TTL: refresh the TTL on every edge observation and let edges disappear naturally when a service stops being called. Store the full edge metadata (latency, counts) in a separate hash keyed by edgeId.

Change Detection

Every five minutes, take a TopologySnapshot by reading the current node and edge sets. Compute a checksum (SHA-256 of sorted edge list). Compare with the previous snapshot checksum. If different, diff the two sets: edges in the new snapshot but not the old are EDGE_ADDED; edges in the old but not the new are EDGE_REMOVED. Apply the same logic to nodes. Emit TopologyChangeEvent records for each change and publish them to a change-events topic for downstream consumers (incident response tools, documentation generators).

Impact Analysis

To find all services that depend on a given service S (its upstream callers), traverse the reverse dependency graph from S using BFS or DFS over rdeps:S. Return the full set of transitive dependents with hop distance. To find all services that S depends on (downstream callees), traverse deps:S. Bound traversal depth to prevent infinite loops in circular dependency graphs (which can legitimately exist with async patterns). Return a ranked list sorted by call volume so operators prioritize the most critical paths first.

Stale Edge Pruning

An edge is stale if its TTL has expired — meaning no span observation refreshed it within the window. Redis TTL handles this automatically for the adjacency list. For the relational edge table, a cleanup job runs hourly and deletes edges where lastSeenAt < now - staleTtlHours. Before deletion, write a EDGE_REMOVED change event to preserve the history.

API Design

GET /topology/graph — full current graph as a node-link JSON document.
GET /topology/services/{name}/dependencies — direct downstream dependencies with edge stats.
GET /topology/services/{name}/dependents — transitive upstream callers up to configurable depth.
GET /topology/impact?services=A,B — union of all dependents for a set of services (used during incident scoping).
GET /topology/snapshots — paginated history of topology snapshots.
GET /topology/changes?since=2h — stream of topology change events within a time range.

Scalability and Observability

Partition the span stream by source service name across topology builder workers to reduce contention on the span cache.
Cache impact query results (transitive dependents) with a short TTL (e.g., 2 minutes) since they are expensive to compute on large graphs but change slowly.
Emit topology_edges_active and topology_change_events_total{type} metrics. A spike in EDGE_REMOVED events often correlates with a deployment or network partition and is worth alerting on.
Integrate change events with your deployment system: when a new version deploys, compare the post-deploy topology snapshot with the pre-deploy one to catch unintended dependency introductions.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is trace-derived topology discovery implemented in a dependency map service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The system processes spans from the tracing pipeline and extracts caller-callee pairs from parent-child span relationships, annotated with service names from span resource attributes. These edges are aggregated over a rolling window to build a weighted directed graph representing observed communication paths, updated continuously as new trace data arrives.”
}
},
{
“@type”: “Question”,
“name”: “What graph storage model is used for a service dependency map?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The dependency graph is stored as an adjacency list or in a dedicated graph database (e.g., Neo4j) where nodes represent services and edges represent call relationships weighted by call rate and error rate. For read-heavy query patterns (neighbor lookup, path traversal), the graph is also cached in memory as an adjacency map keyed by service ID, rebuilt periodically from the authoritative store.”
}
},
{
“@type”: “Question”,
“name”: “How does a dependency map service detect topology changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The current graph snapshot is compared to the previous snapshot at each refresh cycle. New edges (previously unseen caller-callee pairs), removed edges (absent from recent traces beyond a staleness threshold), and weight changes beyond a configured delta are flagged as change events. These events are published to a notification stream for alerting and audit logging.”
}
},
{
“@type”: “Question”,
“name”: “How is blast radius impact analysis performed using the dependency map?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Given a failing or degraded service node, a reverse graph traversal (following incoming edges) identifies all upstream dependents transitively. Each dependent is scored by its call volume to the affected service and its own criticality, producing a ranked blast radius report. This analysis runs on the in-memory graph snapshot to return results in milliseconds, helping on-call engineers prioritize incident response.”
}
}
]
}