Low Level Design: Internal Service Catalog

What Is an Internal Service Catalog?

An internal service catalog is a centralized registry where every service in your infrastructure self-describes: what it does, who owns it, what it depends on, what SLAs it offers, and whether it is healthy right now. It is the foundation for incident response, capacity planning, onboarding, and compliance audits. In a large org with hundreds of microservices this system becomes as important as DNS.

Requirements

Functional

Services can register themselves or be registered manually with a structured metadata payload.
Each service entry stores: name, team owner, tier (critical/standard/best-effort), language/runtime, repo URL, documentation URL, API spec URL, SLA targets, dependencies, and tags.
Dependency graph: directed edges from consumer to provider, stored and queryable in both directions.
Health status aggregation pulled from monitoring systems and surfaced per service.
Full-text and faceted search over all service metadata.
Changelog / audit trail for every mutation.

Non-Functional

Read-heavy workload: hundreds of reads per second from dashboards, CI pipelines, and on-call tooling; writes are infrequent.
Eventual consistency is acceptable for search index; metadata store must be strongly consistent.
P99 read latency < 100 ms.
High availability: catalog must stay readable even if the write path is degraded.

Core Data Model

services
  id            UUID PK
  slug          VARCHAR(128) UNIQUE NOT NULL   -- machine-friendly name, e.g. payment-service
  display_name  VARCHAR(256)
  description   TEXT
  team_id       UUID FK -> teams.id
  tier          ENUM('critical','standard','best_effort')
  status        ENUM('active','deprecated','decommissioned')
  repo_url      TEXT
  docs_url      TEXT
  api_spec_url  TEXT
  sla_uptime    DECIMAL(5,2)   -- e.g. 99.95
  sla_p99_ms    INT
  created_at    TIMESTAMPTZ
  updated_at    TIMESTAMPTZ

service_tags
  service_id    UUID FK
  tag           VARCHAR(64)
  PRIMARY KEY (service_id, tag)

dependencies
  consumer_id   UUID FK -> services.id
  provider_id   UUID FK -> services.id
  dep_type      ENUM('sync','async','data')
  criticality   ENUM('hard','soft')
  PRIMARY KEY (consumer_id, provider_id)

teams
  id            UUID PK
  name          VARCHAR(128)
  slack_channel VARCHAR(128)
  oncall_id     VARCHAR(64)   -- PagerDuty / OpsGenie schedule ID

health_snapshots
  id            UUID PK
  service_id    UUID FK
  source        VARCHAR(64)   -- e.g. datadog, prometheus
  status        ENUM('healthy','degraded','down','unknown')
  checked_at    TIMESTAMPTZ
  details       JSONB

audit_log
  id            UUID PK
  service_id    UUID FK
  actor         VARCHAR(256)
  action        VARCHAR(64)
  diff          JSONB
  created_at    TIMESTAMPTZ

System Architecture

Components

Catalog API: REST + GraphQL gateway. REST for CRUD from CI/CD and CLIs. GraphQL for frontend dashboards that need nested dependency graph queries in a single round-trip.
Metadata Store: PostgreSQL. Handles the authoritative record. Write-ahead log used as event stream for downstream consumers.
Search Index: Elasticsearch (or OpenSearch). Documents contain denormalized service + team data. Updated via Debezium CDC from Postgres WAL. Supports full-text search on name/description/tags and faceted filters on tier, team, status.
Health Aggregator: Background worker polling monitoring APIs (Datadog, Prometheus alertmanager) on a 30-second interval. Writes into health_snapshots. Latest snapshot per service is cached in Redis with a 60-second TTL.
Dependency Graph Engine: PostgreSQL recursive CTEs for small graphs. For large organizations (> 5 000 services), offload to a graph database (Neo4j or Amazon Neptune). Provides: upstream dependencies, downstream dependents, critical path, cycle detection.
Notification Service: Publishes events (service registered, ownership changed, dependency added, health degraded) to a Kafka topic. Other teams subscribe for their own workflows.

Request Flow: Service Lookup

Client calls GET /v1/services/{slug}.
API layer checks Redis cache (key: svc:slug:{slug}, TTL 5 min).
Cache miss: read from Postgres replica. Write result to Redis. Return to client.
Health status is merged from Redis health cache before returning response.

Request Flow: Search

Client calls GET /v1/services?q=payment&tier=critical&team=platform.
API translates params to an Elasticsearch bool query with must (full-text on name/description/tags) and filter (tier, team slug, status).
Results include service slug, display name, team name, tier, current health — all stored in the ES document to avoid secondary lookups.
ES returns ranked results; API paginates with search_after cursor (not offset) for stable deep pagination.

Metadata Schema Design

Strict schema validation on write is enforced via JSON Schema (draft-07). The canonical schema lives in the catalog repo and is versioned. Services submit a catalog-info.yaml in their own repo root; a CI hook validates and upserts on every merge to main.

catalog-info.yaml structure:
  apiVersion: catalog/v1
  kind: Service
  metadata:
    name: payment-service
    team: payments-platform
    tier: critical
    tags: [payments, pci, checkout]
  spec:
    description: Handles payment authorization and capture.
    repo: https://github.com/acme/payment-service
    docs: https://docs.internal/payment-service
    apiSpec: https://apis.internal/payment-service/openapi.json
    sla:
      uptime: 99.95
      p99Latency: 80
    dependencies:
      - name: fraud-service
        type: sync
        criticality: hard
      - name: ledger-service
        type: async
        criticality: soft

Dependency Graph: Queries

PostgreSQL recursive CTE for all upstream dependencies of a service:

WITH RECURSIVE upstream AS (
  SELECT provider_id, dep_type, criticality, 1 AS depth
  FROM dependencies
  WHERE consumer_id = <target_service_id>
  UNION ALL
  SELECT d.provider_id, d.dep_type, d.criticality, u.depth + 1
  FROM dependencies d
  JOIN upstream u ON d.consumer_id = u.provider_id
  WHERE u.depth < 10  -- guard against cycles
)
SELECT s.slug, s.display_name, u.dep_type, u.criticality, u.depth
FROM upstream u
JOIN services s ON s.id = u.provider_id;

Cycle detection runs as a nightly job using DFS. Any cycle found triggers an alert and blocks the dependency write that caused it.

Health Status Aggregation

The Health Aggregator runs a pool of workers, one per monitoring source. Each worker:

Fetches current alert/status data from the external system (Datadog monitor states, Prometheus alertmanager groups).
Maps each alert to a service_id via a mapping table (alert_name or label selector -> service slug).
Derives status: if any CRITICAL alert is firing -> down; if any WARNING -> degraded; else healthy.
Upserts into health_snapshots and updates Redis key svc:health:{service_id}.

The catalog API never calls monitoring systems directly. It reads only from the Redis health cache. This keeps catalog API latency independent of monitoring system latency.

Team Ownership and Access Control

Each service has exactly one owning team. Ownership changes are logged in audit_log.
Write access is restricted: only the owning team (matched via SSO group) or platform-admin can mutate a service record.
Read access is open to all internal users.
SCIM sync keeps the teams table up to date with the IdP. When a team is dissolved, its services are flagged status=deprecated and an alert fires.

Search and Discovery

The Elasticsearch index mapping includes:

name and description as text with English analyzer for stemming.
tags as keyword array for exact-match filtering.
tier, status, team_slug as keyword for facets.
health_status as keyword — denormalized from Redis during the CDC update worker.

Suggest endpoint uses the ES completion suggester on name.suggest field for instant autocomplete in the portal UI. Returns up to 10 results with < 20 ms P99.

Scalability and Caching Strategy

Postgres: single primary + two read replicas behind a connection pooler (PgBouncer). Reads go to replicas. Writes go to primary.
Redis: service metadata cached at 5-minute TTL. Cache is invalidated on write (delete-on-write pattern). Health snapshots cached at 60-second TTL with background refresh.
Elasticsearch: index sharded by hash of service slug. 2 primary shards, 1 replica. At 100 000 services this stays well within a single-node capacity, so horizontal scaling is not needed until then.
For very large dependency graphs: Neptune (property graph, Gremlin queries) replaces Postgres recursive CTEs. The dependency table in Postgres becomes the write-through source; a Lambda replicates inserts/deletes to Neptune.

API Surface

GET    /v1/services                  -- list + search
POST   /v1/services                  -- register new service
GET    /v1/services/{slug}           -- get full service record
PUT    /v1/services/{slug}           -- full replace (idempotent)
PATCH  /v1/services/{slug}           -- partial update
DELETE /v1/services/{slug}           -- mark deprecated/decommissioned

GET    /v1/services/{slug}/deps/upstream
GET    /v1/services/{slug}/deps/downstream
GET    /v1/services/{slug}/health
GET    /v1/services/{slug}/audit

POST   /v1/validate                  -- validate catalog-info.yaml without writing

Interview Discussion Points

Sync vs async registration: Pull model (CI reads catalog-info.yaml and upserts) is preferred over push (service calls catalog on startup) because it decouples availability of the catalog from service startup.
Graph DB vs recursive SQL: For most companies Postgres CTEs are sufficient and simpler to operate. Mention Neptune/Neo4j as a scaling path when graphs exceed tens of thousands of nodes with complex traversals.
Health aggregation latency: 30-second polling is usually fine; for sub-10s freshness use a webhook/push model from alertmanager to the catalog.
Stale cache on ownership change: Write path deletes Redis key immediately (read-your-writes not guaranteed across replicas, but acceptable for this use case).

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is an internal service catalog and what metadata does it store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An internal service catalog is the authoritative registry of all software services in an organization. Each service record stores: name, slug, description, owning team, on-call rotation, tier (criticality level), runtime language and framework, repository URL, CI/CD pipeline links, deployment environment endpoints, SLA targets, and dependency references to other services and datastores. The catalog is the source of truth consumed by developer portals, incident management tools, and infrastructure automation. Keeping it accurate requires enforcement: new service deployments must register via CI, and stale records are flagged by a reconciliation job that compares catalog entries against live infrastructure.”
}
},
{
“@type”: “Question”,
“name”: “How does a service catalog build and query a service dependency graph?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Dependencies are declared explicitly in each service’s catalog manifest (upstream/downstream lists) and supplemented automatically by parsing distributed tracing data (e.g., Jaeger, Zipkin) and service mesh telemetry (Envoy, Linkerd). The catalog stores the graph as a directed adjacency list: edges carry a dependency type (sync call, async event, shared datastore) and a criticality flag. Graph queries—upstream impact radius, downstream blast radius, critical path—are served using BFS/DFS over the in-memory graph loaded from the store. For large graphs, a dedicated graph database (Neo4j, Amazon Neptune) replaces the adjacency list for O(hop) traversal performance.”
}
},
{
“@type”: “Question”,
“name”: “How is service health status aggregated in a service catalog?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each service exposes a health endpoint; a catalog scraper polls it on a configurable interval and writes the result to a health_states table. Health is also inferred from external signals: synthetic monitor status, error-rate thresholds from the metrics platform (Prometheus, Datadog), and open P1/P2 incidents from the incident management system. The catalog computes an aggregate health roll-up per service using a worst-signal-wins rule: if any signal is degraded or down, the service status is degraded or down regardless of the health endpoint. Dependency health is propagated: a service whose critical upstream is down may be marked ‘impacted’ even if its own checks pass.”
}
},
{
“@type”: “Question”,
“name”: “How does a service catalog support developer discovery and search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Developer discovery is powered by a full-text search index (Elasticsearch or Typesense) over all catalog fields: name, description, tags, owning team, and tech stack. Faceted filters allow narrowing by language, tier, team, or health status. The catalog also exposes a semantic search path where service descriptions are embedded into a vector store; a natural-language query (‘who owns the payments retry logic?’) is embedded at query time and matched by cosine similarity. Browse navigation is supported through tag taxonomies and org-tree drill-down by team. APIs and a developer portal UI surface the same search backend so internal tooling and human developers share a consistent discovery experience.”
}
}
]
}