What Is an Internal Service Catalog?
An internal service catalog is a centralized registry where every service in your infrastructure self-describes: what it does, who owns it, what it depends on, what SLAs it offers, and whether it is healthy right now. It is the foundation for incident response, capacity planning, onboarding, and compliance audits. In a large org with hundreds of microservices this system becomes as important as DNS.
Requirements
Functional
- Services can register themselves or be registered manually with a structured metadata payload.
- Each service entry stores: name, team owner, tier (critical/standard/best-effort), language/runtime, repo URL, documentation URL, API spec URL, SLA targets, dependencies, and tags.
- Dependency graph: directed edges from consumer to provider, stored and queryable in both directions.
- Health status aggregation pulled from monitoring systems and surfaced per service.
- Full-text and faceted search over all service metadata.
- Changelog / audit trail for every mutation.
Non-Functional
- Read-heavy workload: hundreds of reads per second from dashboards, CI pipelines, and on-call tooling; writes are infrequent.
- Eventual consistency is acceptable for search index; metadata store must be strongly consistent.
- P99 read latency < 100 ms.
- High availability: catalog must stay readable even if the write path is degraded.
Core Data Model
services
id UUID PK
slug VARCHAR(128) UNIQUE NOT NULL -- machine-friendly name, e.g. payment-service
display_name VARCHAR(256)
description TEXT
team_id UUID FK -> teams.id
tier ENUM('critical','standard','best_effort')
status ENUM('active','deprecated','decommissioned')
repo_url TEXT
docs_url TEXT
api_spec_url TEXT
sla_uptime DECIMAL(5,2) -- e.g. 99.95
sla_p99_ms INT
created_at TIMESTAMPTZ
updated_at TIMESTAMPTZ
service_tags
service_id UUID FK
tag VARCHAR(64)
PRIMARY KEY (service_id, tag)
dependencies
consumer_id UUID FK -> services.id
provider_id UUID FK -> services.id
dep_type ENUM('sync','async','data')
criticality ENUM('hard','soft')
PRIMARY KEY (consumer_id, provider_id)
teams
id UUID PK
name VARCHAR(128)
slack_channel VARCHAR(128)
oncall_id VARCHAR(64) -- PagerDuty / OpsGenie schedule ID
health_snapshots
id UUID PK
service_id UUID FK
source VARCHAR(64) -- e.g. datadog, prometheus
status ENUM('healthy','degraded','down','unknown')
checked_at TIMESTAMPTZ
details JSONB
audit_log
id UUID PK
service_id UUID FK
actor VARCHAR(256)
action VARCHAR(64)
diff JSONB
created_at TIMESTAMPTZ
System Architecture
Components
- Catalog API: REST + GraphQL gateway. REST for CRUD from CI/CD and CLIs. GraphQL for frontend dashboards that need nested dependency graph queries in a single round-trip.
- Metadata Store: PostgreSQL. Handles the authoritative record. Write-ahead log used as event stream for downstream consumers.
- Search Index: Elasticsearch (or OpenSearch). Documents contain denormalized service + team data. Updated via Debezium CDC from Postgres WAL. Supports full-text search on name/description/tags and faceted filters on tier, team, status.
- Health Aggregator: Background worker polling monitoring APIs (Datadog, Prometheus alertmanager) on a 30-second interval. Writes into health_snapshots. Latest snapshot per service is cached in Redis with a 60-second TTL.
- Dependency Graph Engine: PostgreSQL recursive CTEs for small graphs. For large organizations (> 5 000 services), offload to a graph database (Neo4j or Amazon Neptune). Provides: upstream dependencies, downstream dependents, critical path, cycle detection.
- Notification Service: Publishes events (service registered, ownership changed, dependency added, health degraded) to a Kafka topic. Other teams subscribe for their own workflows.
Request Flow: Service Lookup
- Client calls
GET /v1/services/{slug}. - API layer checks Redis cache (key:
svc:slug:{slug}, TTL 5 min). - Cache miss: read from Postgres replica. Write result to Redis. Return to client.
- Health status is merged from Redis health cache before returning response.
Request Flow: Search
- Client calls
GET /v1/services?q=payment&tier=critical&team=platform. - API translates params to an Elasticsearch bool query with must (full-text on name/description/tags) and filter (tier, team slug, status).
- Results include service slug, display name, team name, tier, current health — all stored in the ES document to avoid secondary lookups.
- ES returns ranked results; API paginates with search_after cursor (not offset) for stable deep pagination.
Metadata Schema Design
Strict schema validation on write is enforced via JSON Schema (draft-07). The canonical schema lives in the catalog repo and is versioned. Services submit a catalog-info.yaml in their own repo root; a CI hook validates and upserts on every merge to main.
catalog-info.yaml structure:
apiVersion: catalog/v1
kind: Service
metadata:
name: payment-service
team: payments-platform
tier: critical
tags: [payments, pci, checkout]
spec:
description: Handles payment authorization and capture.
repo: https://github.com/acme/payment-service
docs: https://docs.internal/payment-service
apiSpec: https://apis.internal/payment-service/openapi.json
sla:
uptime: 99.95
p99Latency: 80
dependencies:
- name: fraud-service
type: sync
criticality: hard
- name: ledger-service
type: async
criticality: soft
Dependency Graph: Queries
PostgreSQL recursive CTE for all upstream dependencies of a service:
WITH RECURSIVE upstream AS ( SELECT provider_id, dep_type, criticality, 1 AS depth FROM dependencies WHERE consumer_id = <target_service_id> UNION ALL SELECT d.provider_id, d.dep_type, d.criticality, u.depth + 1 FROM dependencies d JOIN upstream u ON d.consumer_id = u.provider_id WHERE u.depth < 10 -- guard against cycles ) SELECT s.slug, s.display_name, u.dep_type, u.criticality, u.depth FROM upstream u JOIN services s ON s.id = u.provider_id;
Cycle detection runs as a nightly job using DFS. Any cycle found triggers an alert and blocks the dependency write that caused it.
Health Status Aggregation
The Health Aggregator runs a pool of workers, one per monitoring source. Each worker:
- Fetches current alert/status data from the external system (Datadog monitor states, Prometheus alertmanager groups).
- Maps each alert to a service_id via a mapping table (alert_name or label selector -> service slug).
- Derives status: if any CRITICAL alert is firing ->
down; if any WARNING ->degraded; elsehealthy. - Upserts into health_snapshots and updates Redis key
svc:health:{service_id}.
The catalog API never calls monitoring systems directly. It reads only from the Redis health cache. This keeps catalog API latency independent of monitoring system latency.
Team Ownership and Access Control
- Each service has exactly one owning team. Ownership changes are logged in audit_log.
- Write access is restricted: only the owning team (matched via SSO group) or platform-admin can mutate a service record.
- Read access is open to all internal users.
- SCIM sync keeps the teams table up to date with the IdP. When a team is dissolved, its services are flagged
status=deprecatedand an alert fires.
Search and Discovery
The Elasticsearch index mapping includes:
nameanddescriptionastextwith English analyzer for stemming.tagsaskeywordarray for exact-match filtering.tier,status,team_slugaskeywordfor facets.health_statusaskeyword— denormalized from Redis during the CDC update worker.
Suggest endpoint uses the ES completion suggester on name.suggest field for instant autocomplete in the portal UI. Returns up to 10 results with < 20 ms P99.
Scalability and Caching Strategy
- Postgres: single primary + two read replicas behind a connection pooler (PgBouncer). Reads go to replicas. Writes go to primary.
- Redis: service metadata cached at 5-minute TTL. Cache is invalidated on write (delete-on-write pattern). Health snapshots cached at 60-second TTL with background refresh.
- Elasticsearch: index sharded by hash of service slug. 2 primary shards, 1 replica. At 100 000 services this stays well within a single-node capacity, so horizontal scaling is not needed until then.
- For very large dependency graphs: Neptune (property graph, Gremlin queries) replaces Postgres recursive CTEs. The dependency table in Postgres becomes the write-through source; a Lambda replicates inserts/deletes to Neptune.
API Surface
GET /v1/services -- list + search
POST /v1/services -- register new service
GET /v1/services/{slug} -- get full service record
PUT /v1/services/{slug} -- full replace (idempotent)
PATCH /v1/services/{slug} -- partial update
DELETE /v1/services/{slug} -- mark deprecated/decommissioned
GET /v1/services/{slug}/deps/upstream
GET /v1/services/{slug}/deps/downstream
GET /v1/services/{slug}/health
GET /v1/services/{slug}/audit
POST /v1/validate -- validate catalog-info.yaml without writing
Interview Discussion Points
- Sync vs async registration: Pull model (CI reads catalog-info.yaml and upserts) is preferred over push (service calls catalog on startup) because it decouples availability of the catalog from service startup.
- Graph DB vs recursive SQL: For most companies Postgres CTEs are sufficient and simpler to operate. Mention Neptune/Neo4j as a scaling path when graphs exceed tens of thousands of nodes with complex traversals.
- Health aggregation latency: 30-second polling is usually fine; for sub-10s freshness use a webhook/push model from alertmanager to the catalog.
- Stale cache on ownership change: Write path deletes Redis key immediately (read-your-writes not guaranteed across replicas, but acceptable for this use case).
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is an internal service catalog and what metadata does it store?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An internal service catalog is the authoritative registry of all software services in an organization. Each service record stores: name, slug, description, owning team, on-call rotation, tier (criticality level), runtime language and framework, repository URL, CI/CD pipeline links, deployment environment endpoints, SLA targets, and dependency references to other services and datastores. The catalog is the source of truth consumed by developer portals, incident management tools, and infrastructure automation. Keeping it accurate requires enforcement: new service deployments must register via CI, and stale records are flagged by a reconciliation job that compares catalog entries against live infrastructure.”
}
},
{
“@type”: “Question”,
“name”: “How does a service catalog build and query a service dependency graph?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Dependencies are declared explicitly in each service’s catalog manifest (upstream/downstream lists) and supplemented automatically by parsing distributed tracing data (e.g., Jaeger, Zipkin) and service mesh telemetry (Envoy, Linkerd). The catalog stores the graph as a directed adjacency list: edges carry a dependency type (sync call, async event, shared datastore) and a criticality flag. Graph queries—upstream impact radius, downstream blast radius, critical path—are served using BFS/DFS over the in-memory graph loaded from the store. For large graphs, a dedicated graph database (Neo4j, Amazon Neptune) replaces the adjacency list for O(hop) traversal performance.”
}
},
{
“@type”: “Question”,
“name”: “How is service health status aggregated in a service catalog?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each service exposes a health endpoint; a catalog scraper polls it on a configurable interval and writes the result to a health_states table. Health is also inferred from external signals: synthetic monitor status, error-rate thresholds from the metrics platform (Prometheus, Datadog), and open P1/P2 incidents from the incident management system. The catalog computes an aggregate health roll-up per service using a worst-signal-wins rule: if any signal is degraded or down, the service status is degraded or down regardless of the health endpoint. Dependency health is propagated: a service whose critical upstream is down may be marked ‘impacted’ even if its own checks pass.”
}
},
{
“@type”: “Question”,
“name”: “How does a service catalog support developer discovery and search?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Developer discovery is powered by a full-text search index (Elasticsearch or Typesense) over all catalog fields: name, description, tags, owning team, and tech stack. Faceted filters allow narrowing by language, tier, team, or health status. The catalog also exposes a semantic search path where service descriptions are embedded into a vector store; a natural-language query (‘who owns the payments retry logic?’) is embedded at query time and matched by cosine similarity. Browse navigation is supported through tag taxonomies and org-tree drill-down by team. APIs and a developer portal UI surface the same search backend so internal tooling and human developers share a consistent discovery experience.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide