What mechanisms are used for asset discovery in a metadata catalog?

Asset discovery relies on push-based registration (producers emit metadata events on publish), pull-based crawling (the catalog periodically scans data sources via connectors), and passive interception (observing query logs or S3 notifications). Combining all three maximizes coverage while keeping registration burden low.

How is a hierarchical tag taxonomy designed for a metadata catalog?

Tags are organized in a parent-child tree (e.g., Sensitivity > PII > Email) stored in a closure table or nested-set model. Assets inherit ancestor tags automatically, allowing policy enforcement at any level. Tag governance includes approval workflows and a dictionary of canonical terms to prevent sprawl.

How is full-text search implemented over catalog assets?

An inverted index (Elasticsearch or OpenSearch) ingests asset names, descriptions, column names, and tag values. Write path publishes change events to a queue consumed by an indexer; read path executes ranked queries with boosting rules that prioritize name matches over description matches. Faceted filters on owner, domain, and tag narrow results.

How are data ownership and stewardship modeled in a metadata catalog?

Each asset carries a technical owner (team or service account responsible for SLA) and one or more data stewards (business stakeholders accountable for quality and semantics). Ownership is stored as foreign keys to an identity store. Change notifications, quality alerts, and access-request approvals are routed to the relevant owner/steward role.

Metadata Catalog Low-Level Design: Asset Discovery, Tag Taxonomy, and Search Index

⏱ 5 min read

Metadata Catalog: Overview and Requirements

A metadata catalog is the central directory of data assets in an organization. It enables teams to discover datasets, understand their structure and ownership, apply governance tags, and search across all registered assets. Think of it as the search engine for your data platform.

Functional Requirements

Automatically discover and register assets from connected data sources (databases, data lakes, BI tools, streaming topics).
Maintain a hierarchical tag taxonomy: domain, subdomain, sensitivity classification, PII flag.
Provide full-text and faceted search over asset names, descriptions, tags, and column names.
Track data ownership: technical owner (team), business owner (stakeholder), and steward.
Surface usage statistics: query frequency, last accessed timestamp, consumer count.

Non-Functional Requirements

Discovery crawl completion under 30 minutes for sources with up to 100,000 tables.
Search query latency under 300 ms at the 95th percentile.
Eventual consistency between source system schema changes and catalog state within 1 hour.
Role-based access: assets marked sensitive are visible only to authorized roles.

Data Model

Asset Entity

asset_id — UUID.
asset_type — TABLE, VIEW, TOPIC, DASHBOARD, MODEL, PIPELINE.
fqn — fully qualified name used as the stable external identifier.
source_id — foreign key to the registered data source connector.
description — free-text description, editable by owners.
schema_snapshot — JSONB column list with types, captured at crawl time.
created_at, last_crawled_at, last_accessed_at.

Tag Taxonomy

tag_id, parent_tag_id — self-referential hierarchy.
name, color, description.
classification_level — PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED.

Asset-Tag Junction

Many-to-many table linking asset_id to tag_id with applied_by, applied_at, and scope (ASSET or COLUMN plus column_name for column-level tagging).

Core Algorithms

Asset Discovery Crawl

Each registered source has a connector that speaks the source protocol (JDBC for relational databases, REST for BI APIs, Kafka AdminClient for topics). A scheduler triggers crawl jobs on a configurable cadence (default: hourly). The crawl algorithm:

Fetch the current asset list from the source.
Diff against the last crawl snapshot stored in the catalog.
Insert new assets, update changed schemas, soft-delete dropped assets.
Emit a change event for each modified asset so downstream systems (lineage service, search index) can react.

Search Indexing

Use Elasticsearch or OpenSearch as the search backend. On asset create or update, index a document containing fqn, asset_type, description, column names, tag names, and owner identifiers. Configure a multi-field analyzer: keyword for exact faceting, standard for full-text, and edge-ngram for prefix typeahead.

Faceted filters: asset_type, classification_level, domain tag, owner team.
Boosting: assets with descriptions and recent access score higher than undocumented stale assets.
Highlighting: return matched excerpt snippets to explain why a result was returned.

Scalability Design

Run crawl workers as horizontally scalable pods. Distribute source connectors across workers via a work-queue (Redis or Kafka) to avoid a single crawler becoming a bottleneck.
Use change-data-capture (CDC) on the catalog PostgreSQL database to feed the Elasticsearch index asynchronously, avoiding dual-write inconsistencies.
Cache popular search queries in Redis with a short TTL (60 seconds) to absorb burst traffic from dashboard load events where dozens of users hit search simultaneously.
Shard the Elasticsearch index by asset_type so a spike in TABLE document updates does not degrade DASHBOARD search latency.

API Design

GET /v1/assets/search?q={query}&type={type}&tag={tag}&page={n} — full-text search with facet filters; returns asset summaries with highlight snippets.
GET /v1/assets/{asset_id} — retrieve full asset detail including schema snapshot, tags, owners, and usage stats.
PATCH /v1/assets/{asset_id} — update description and owner fields; triggers re-indexing.
POST /v1/assets/{asset_id}/tags — apply tags to an asset or specific column.
DELETE /v1/assets/{asset_id}/tags/{tag_id} — remove a tag association.
POST /v1/sources — register a new data source connector with credentials and crawl schedule.
POST /v1/sources/{source_id}/crawl — trigger an on-demand crawl.

Data Ownership and Governance

Ownership records include technical_owner_team, business_owner_email, and steward_email stored as structured fields, not free text, so they are queryable and used in access control decisions.
Assets with RESTRICTED classification require an approved data access request before the schema snapshot is visible to non-owner roles.
Orphan detection: a scheduled job flags assets where last_accessed_at is older than 90 days and no owner has confirmed relevance — surfacing them for deprecation review.

Observability

Track crawl success rate and duration per source connector to detect degraded or stalled integrations.
Monitor search query latency and zero-result rate — a rising zero-result rate signals that newly created assets are not being indexed promptly.
Expose a catalog coverage metric: percentage of known assets that have a description, at least one tag, and a defined owner.