Data Catalog Low-Level Design: Asset Discovery, Schema Registry, and Business Glossary

Data Catalog Overview

A Data Catalog is the central registry of data assets within an organization. It provides automated discovery of tables, topics, and files across storage systems; a schema registry for tracking column-level metadata and evolution; a business glossary that maps technical names to human-readable definitions; and stewardship workflows that keep metadata accurate and up to date.

Requirements

Functional Requirements

Automatically discover and register data assets (tables, Kafka topics, S3 prefixes, API endpoints) from connected source systems.
Store and version schemas: column names, data types, nullable flags, and descriptions.
Detect schema changes (column added, dropped, or type-changed) and notify owners.
Maintain a business glossary: term definitions, owners, synonyms, and links to assets that implement each term.
Support search across assets by name, tag, glossary term, owner, and column-level attributes.
Provide stewardship workflows: data owners can review, approve, and publish metadata updates.

Non-Functional Requirements

Asset discovery crawlers complete a full scan of 1 million assets within 4 hours.
Search queries return results within 300 ms at p95 over a catalog of 1 million assets.
Schema version history retained indefinitely for compliance purposes.

Data Model

The Asset document (stored in a document database or search index) contains: asset_id UUID, asset_type ENUM (table, topic, file-prefix, api), qualified_name (fully qualified path, e.g., warehouse.db.table), display_name, description TEXT, owner_team, tags LIST, created_at, last_crawled_at, and source_system.

The SchemaVersion table records schema snapshots: schema_id UUID, asset_id, version INT, columns JSON (array of {name, type, nullable, description, pii_flag}), effective_from TIMESTAMP, and change_summary TEXT. The current schema is the highest-version record per asset.

The GlossaryTerm table stores: term_id UUID, name, definition TEXT, synonyms LIST, owner_team, status ENUM (draft, published, deprecated), related_asset_ids LIST, and parent_term_id (for hierarchical glossary structure).

Core Algorithms

Asset Discovery

Source-system crawlers connect to each registered integration (Hive Metastore, AWS Glue, Confluent Schema Registry, PostgreSQL information_schema) and emit asset manifests to a Kafka ingestion topic. A discovery consumer upserts asset records and compares incoming schemas against the latest stored schema version. If a difference is detected (field-level diff using a JSON patch algorithm), a new SchemaVersion record is written and change notifications are queued for owners.

Schema Registry Integration

For Kafka topics backed by Avro or Protobuf schemas, the catalog integrates directly with the Schema Registry API. On each crawl cycle, it fetches the latest schema ID for each subject, retrieves the schema if the ID has changed, and writes a new SchemaVersion. The catalog enforces the registry compatibility rules (backward, forward, full) as metadata on the asset record, providing a single place to check both schema content and evolution policy.

Search Indexing

Asset records, schema column names, descriptions, and glossary terms are indexed in Elasticsearch (or OpenSearch). A custom analyzer applies a camelCase and snake_case tokenizer so that customerOrderId and customer_order_id match the same query tokens. Tag and owner fields use keyword indexing for exact-match filtering. Search results are ranked by a combination of BM25 text relevance, asset popularity (access frequency), and metadata completeness score.

API Design

SearchAssets(SearchRequest) → SearchResults — full-text and faceted search over assets and glossary terms.
GetAsset(AssetId) → AssetDetail — returns full asset record including current schema, tags, and glossary term links.
GetSchemaHistory(AssetId) → SchemaVersionList — returns all schema versions for an asset in chronological order.
RegisterGlossaryTerm(GlossaryTerm) → TermId — creates or updates a glossary term; new terms start in draft status.
PublishGlossaryTerm(TermId) → Ack — stewardship approval step that transitions a term to published status.
TagAsset(AssetId, Tags) → Ack — adds or removes tags on an asset; triggers re-indexing.

Scalability and Fault Tolerance

Discovery crawlers are isolated per source system and run as independent scheduled jobs. Each crawler is stateless; it reads the current asset list from the source and emits manifests to Kafka. If a crawler fails, the next scheduled run picks up from the current source state — no checkpointing needed, at the cost of re-emitting unchanged assets (handled by idempotent upsert).

The search index is updated asynchronously from a Kafka consumer. Search reads are served from Elasticsearch replicas; writes go to the primary shard. The catalog service is stateless and horizontally scalable. Asset and schema metadata are stored in a relational database with connection pooling; read replicas serve catalog browse and API traffic to keep write load off the primary.

Monitoring

Track crawler success rate and last-crawled-at age per source system; alert if any source has not been crawled within 2x its scheduled interval.
Monitor schema change event rate; a spike may indicate a schema migration wave or a misconfigured crawler emitting false diffs.
Track search query latency and zero-result rate; high zero-result rate on non-trivial queries suggests indexing or tokenization gaps.
Report metadata completeness: fraction of assets with owner, description, and at least one tag assigned.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does automated asset discovery via crawlers work in a data catalog?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Crawlers connect to registered data sources (data warehouses, S3 buckets, Kafka topics, databases) using read-only credentials and enumerate tables, files, and streams. For each asset they extract schema, row counts, last-modified timestamps, and sample values. Results are upserted into the catalog's metadata store on a scheduled cadence or triggered by source change events.”
}
},
{
“@type”: “Question”,
“name”: “How is schema diff detection implemented in a data catalog?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each crawl stores a snapshot of the asset's schema (column names, types, nullability). On subsequent crawls, the new schema is diffed against the stored snapshot. Added, removed, or type-changed columns are surfaced as change events. Owners are notified, and downstream consumers can subscribe to schema change alerts to proactively update their pipelines.”
}
},
{
“@type”: “Question”,
“name”: “How is Elasticsearch used for multi-field indexing in a data catalog?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each catalog asset is indexed as an Elasticsearch document with fields for asset name, description, column names, tags, owner, and data domain. Multi-field indexing allows a single search query to match across all fields with per-field boosting (e.g., name matches rank higher than description matches). Autocomplete is served via edge n-gram analyzers on name and tag fields.”
}
},
{
“@type”: “Question”,
“name”: “What is a stewardship workflow in a data catalog?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A stewardship workflow assigns ownership and curation tasks for catalog assets. When a new asset is discovered or an existing one lacks an owner, a task is created and routed to a candidate steward (inferred from access patterns or org chart). Stewards confirm ownership, add descriptions and tags, and attest to data quality. Completion rates are tracked as a catalog health metric.”
}
}
]
}