What Is a Tagging Service?
A tagging service allows users and systems to attach descriptive labels (tags) to arbitrary entities such as articles, tasks, products, or code commits. It must support a hierarchical tag taxonomy, provide fast autocomplete suggestions, enable bulk tagging operations, and answer cross-entity queries like “give me all entities tagged with both ‘backend’ and ‘urgent’.”
Requirements
Functional Requirements
- Create, update, and delete tags with optional parent-child hierarchy.
- Attach one or more tags to any entity; detach tags; list all tags on an entity.
- Autocomplete tag suggestions as a user types (prefix and fuzzy match).
- Query all entities of a given type that carry a specified set of tags (intersection and union modes).
- Support bulk tag operations: apply a tag to thousands of entities in one API call.
Non-Functional Requirements
- Autocomplete must return results under 50 ms at p99.
- Cross-entity tag queries must support result sets of up to 100,000 entities with cursor pagination.
- Bulk apply of up to 10,000 entities must complete within 10 seconds.
Data Model
- tag: tag_id, name (unique within namespace), namespace, parent_tag_id (nullable), slug, created_by, created_at.
- entity_tag: entity_tag_id, tag_id, entity_id, entity_type, tagged_by, tagged_at. Composite index on (tag_id, entity_type) for cross-entity queries and on (entity_id, entity_type) for per-entity tag lists.
- tag_alias: alias_id, canonical_tag_id, alias_name. Used to redirect deprecated tag names to their canonical replacement.
The parent-child hierarchy is a self-referencing foreign key on tag. For deep hierarchies an adjacency list works up to three or four levels; for unlimited depth a closure table (storing all ancestor-descendant pairs) enables efficient subtree queries.
Core Algorithms
Autocomplete with Trie and Inverted Index
Tag names are indexed in an in-memory trie on the service for prefix matching. On startup the service loads all tag names into the trie. Updates (new tags, renames) are applied incrementally. For fuzzy matching (handling typos) the service additionally uses a trigram inverted index: each tag name is split into overlapping three-character n-grams stored in a hash map pointing to tag_ids. A query computes trigrams for the input and finds the tags with the highest trigram overlap, scored by Jaccard similarity. Results below a minimum score threshold are filtered out.
Cross-Entity Tag Query
For intersection queries (entities having ALL of a set of tags), the service queries the entity_tag table for each tag individually, producing sorted lists of entity_ids, then performs a merge-join to find the common set. For union queries it uses a heap-merge over the sorted lists. Both operations are O(N log K) where N is the total result set size and K is the number of tags. An execution planner orders the per-tag scans from smallest to largest (using stored tag entity counts) to prune early during intersection.
Bulk Tag Application
Bulk apply receives a list of entity_ids and a list of tag_ids. It splits the entities into batches of 500 and inserts rows into entity_tag using multi-row INSERT … ON CONFLICT DO NOTHING (for idempotency). Each batch is a single database round-trip. A bulk operation is tracked in a bulk_job table with progress counters, allowing the caller to poll for completion. On failure the job stores the last processed offset so a retry can resume without reprocessing already-tagged entities.
API Design
GET /tags/suggest?q=back&namespace=project— returns top 10 autocomplete matches with tag_id and display name.POST /entities/{entity_type}/{entity_id}/tags— body: {tag_ids: […]}. Attaches tags; idempotent.DELETE /entities/{entity_type}/{entity_id}/tags/{tag_id}— detaches a single tag.GET /tags/{tag_id}/entities?entity_type=task&cursor=X— paginated list of entities carrying this tag.POST /tags/query— body: {tag_ids: […], mode: “intersection”|”union”, entity_type: “task”}. Cross-entity query.POST /bulk/tag— body: {entity_ids: […], tag_ids: […]}. Returns bulk_job_id.
Scalability and Reliability
Read Cache for Popular Tags
Tags with high entity counts are cached in Redis as sorted sets keyed by tag_id and entity_type, ordered by tagged_at. Pagination uses the Redis ZRANGEBYSCORE command with cursor-based offsets. Cache entries are invalidated on tag/entity mutations. Tags below a popularity threshold are served directly from the database to avoid cache pollution.
Namespace Isolation
Tags are scoped to a namespace (e.g. workspace_id or project_id). All queries include the namespace in the index prefix. This prevents tag name collisions across tenants and allows efficient pruning — a query for namespace A never touches namespace B data. Namespace-level quotas (max tags, max entity_tag rows) are enforced to prevent abuse.
Trie Refresh Strategy
The in-memory trie is rebuilt from the database on service startup. New tags are added to the live trie without a restart using an async update channel. A background job runs every five minutes to reconcile any missed updates. Memory usage is bounded by limiting the trie to the active namespace on each service instance using consistent hashing of namespaces to instances.
Trade-offs and Interview Discussion Points
- Flat tags versus hierarchy: flat tags are simple to implement and query; hierarchies enable “all tasks tagged with any subtype of Backend” queries but add complexity to the data model and query planner.
- Inverted index in-process versus Elasticsearch: an in-process trie is faster for prefix autocomplete with no network hop, but Elasticsearch provides richer fuzzy search and scales horizontally with no custom code. The right choice depends on search richness requirements.
- Synchronous versus async bulk apply: synchronous completion is simpler for the caller but can time out for large batches. The async job model with polling adds a round-trip but handles arbitrarily large inputs without HTTP timeouts.
See also: Atlassian Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering