Metadata Catalog Low-Level Design: Asset Discovery, Tag Taxonomy, and Search Index

Metadata Catalog: Overview and Requirements

A metadata catalog is the central directory of data assets in an organization. It enables teams to discover datasets, understand their structure and ownership, apply governance tags, and search across all registered assets. Think of it as the search engine for your data platform.

Functional Requirements

  • Automatically discover and register assets from connected data sources (databases, data lakes, BI tools, streaming topics).
  • Maintain a hierarchical tag taxonomy: domain, subdomain, sensitivity classification, PII flag.
  • Provide full-text and faceted search over asset names, descriptions, tags, and column names.
  • Track data ownership: technical owner (team), business owner (stakeholder), and steward.
  • Surface usage statistics: query frequency, last accessed timestamp, consumer count.

Non-Functional Requirements

  • Discovery crawl completion under 30 minutes for sources with up to 100,000 tables.
  • Search query latency under 300 ms at the 95th percentile.
  • Eventual consistency between source system schema changes and catalog state within 1 hour.
  • Role-based access: assets marked sensitive are visible only to authorized roles.

Data Model

Asset Entity

  • asset_id — UUID.
  • asset_type — TABLE, VIEW, TOPIC, DASHBOARD, MODEL, PIPELINE.
  • fqn — fully qualified name used as the stable external identifier.
  • source_id — foreign key to the registered data source connector.
  • description — free-text description, editable by owners.
  • schema_snapshot — JSONB column list with types, captured at crawl time.
  • created_at, last_crawled_at, last_accessed_at.

Tag Taxonomy

  • tag_id, parent_tag_id — self-referential hierarchy.
  • name, color, description.
  • classification_level — PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED.

Asset-Tag Junction

Many-to-many table linking asset_id to tag_id with applied_by, applied_at, and scope (ASSET or COLUMN plus column_name for column-level tagging).

Core Algorithms

Asset Discovery Crawl

Each registered source has a connector that speaks the source protocol (JDBC for relational databases, REST for BI APIs, Kafka AdminClient for topics). A scheduler triggers crawl jobs on a configurable cadence (default: hourly). The crawl algorithm:

  • Fetch the current asset list from the source.
  • Diff against the last crawl snapshot stored in the catalog.
  • Insert new assets, update changed schemas, soft-delete dropped assets.
  • Emit a change event for each modified asset so downstream systems (lineage service, search index) can react.

Search Indexing

Use Elasticsearch or OpenSearch as the search backend. On asset create or update, index a document containing fqn, asset_type, description, column names, tag names, and owner identifiers. Configure a multi-field analyzer: keyword for exact faceting, standard for full-text, and edge-ngram for prefix typeahead.

  • Faceted filters: asset_type, classification_level, domain tag, owner team.
  • Boosting: assets with descriptions and recent access score higher than undocumented stale assets.
  • Highlighting: return matched excerpt snippets to explain why a result was returned.

Scalability Design

  • Run crawl workers as horizontally scalable pods. Distribute source connectors across workers via a work-queue (Redis or Kafka) to avoid a single crawler becoming a bottleneck.
  • Use change-data-capture (CDC) on the catalog PostgreSQL database to feed the Elasticsearch index asynchronously, avoiding dual-write inconsistencies.
  • Cache popular search queries in Redis with a short TTL (60 seconds) to absorb burst traffic from dashboard load events where dozens of users hit search simultaneously.
  • Shard the Elasticsearch index by asset_type so a spike in TABLE document updates does not degrade DASHBOARD search latency.

API Design

  • GET /v1/assets/search?q={query}&type={type}&tag={tag}&page={n} — full-text search with facet filters; returns asset summaries with highlight snippets.
  • GET /v1/assets/{asset_id} — retrieve full asset detail including schema snapshot, tags, owners, and usage stats.
  • PATCH /v1/assets/{asset_id} — update description and owner fields; triggers re-indexing.
  • POST /v1/assets/{asset_id}/tags — apply tags to an asset or specific column.
  • DELETE /v1/assets/{asset_id}/tags/{tag_id} — remove a tag association.
  • POST /v1/sources — register a new data source connector with credentials and crawl schedule.
  • POST /v1/sources/{source_id}/crawl — trigger an on-demand crawl.

Data Ownership and Governance

  • Ownership records include technical_owner_team, business_owner_email, and steward_email stored as structured fields, not free text, so they are queryable and used in access control decisions.
  • Assets with RESTRICTED classification require an approved data access request before the schema snapshot is visible to non-owner roles.
  • Orphan detection: a scheduled job flags assets where last_accessed_at is older than 90 days and no owner has confirmed relevance — surfacing them for deprecation review.

Observability

  • Track crawl success rate and duration per source connector to detect degraded or stalled integrations.
  • Monitor search query latency and zero-result rate — a rising zero-result rate signals that newly created assets are not being indexed promptly.
  • Expose a catalog coverage metric: percentage of known assets that have a description, at least one tag, and a defined owner.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top