Question 1

How does slug normalization prevent tag fragmentation?

Accepted Answer

Without normalization, users create "Node.js", "nodejs", "node-js", "NodeJS" as four separate tags. The real-world distribution: hundreds of valid spellings for the same concept. Each variant has a fraction of the true engagement count — none reaches the critical mass to appear in autocomplete suggestions. Normalization maps all variants to a single canonical slug: lowercase, replace non-alphanumeric characters with hyphens, collapse multiple hyphens, strip leading/trailing hyphens. "Node.JS" → "node-js". The synonym table handles intentional aliases: "k8s" → "kubernetes", "ml" → "machine-learning". Run a monthly dedup job: find tags where slug edit distance is <3 and content overlap is >70% — propose merges for human review. After merging, update all ContentTag rows to point to the canonical tag_id.

Question 2

How do you build and maintain the tag autocomplete index efficiently?

Accepted Answer

The autocomplete sorted set stores up to 20 tags per prefix (e.g., tags:prefix:py contains the 20 most popular tags starting with "py"). On each tag creation or usage_count update: iterate all prefixes of the slug (1 to len(slug) characters), ZADD {prefix}: {slug} with score=usage_count, ZREMRANGEBYRANK to keep top 20. This is O(L) Redis writes per tag update (L = slug length, max 50). Problem: updating all prefixes on every usage_count increment is expensive at high write volume. Solution: batch sync instead of real-time. Increment usage_count in PostgreSQL (cheap, one UPDATE). Run a Redis sync job every 15 minutes: SELECT top 50K tags by usage_count change, re-index them. This decouples the hot write path from Redis writes.

Question 3

How does AND-based tag search scale with large tag sets?

Accepted Answer

AND search requires content that has ALL K specified tags. The naive HAVING COUNT(DISTINCT tag_id) = K approach scans all content rows for any of the K tags. For K=3 and 10M tag assignments, this might scan 3M rows. Optimization: start with the rarest tag (lowest usage_count from ContentTag). Get its content IDs. Then filter to only those content IDs that also have the next rarest tag, and so on. Query: WITH rarest AS (SELECT content_id FROM ContentTag WHERE tag_id=rarest_tag), second AS (SELECT ct.content_id FROM ContentTag ct JOIN rarest r USING (content_id) WHERE tag_id=second_tag) SELECT ct.content_id FROM ContentTag ct JOIN second s USING (content_id) WHERE tag_id=third_tag. Intersecting small sets first minimizes intermediate result sizes.

Question 4

How do you validate and sandbox user-generated tags on a marketplace?

Accepted Answer

Marketplaces (Etsy, Shopify) have strict tag policies: tags must describe the product, not be promotional, and not contain prohibited terms. Validation pipeline: (1) blocklist check — reject tags matching prohibited words (brand names, profanity, competitor names); (2) allowlist for sensitive categories — jewelry and art tags must be from an approved vocabulary; (3) length and character limits — max 50 characters, no URLs; (4) AI classifier for spam/irrelevant tags (a "free shipping" tag on a product is not descriptive); (5) human review queue for new sellers (first 30 days, all tags reviewed). Apply lighter restrictions for established sellers with clean track records. Shadow-ban policy violators: their tags show to themselves but not to other users — they don't know they're sandboxed.

Question 5

How do ML-generated tags integrate with user-generated ones in search ranking?

Accepted Answer

ML tags (source='ml_model', confidence=0.85) have different reliability than user tags (source='user'). In search, weight ML tags by confidence: a 0.9-confidence ML tag is nearly as strong as a user tag; a 0.6-confidence tag contributes 60% as much to relevance. In display: show ML tags below user tags, in a lighter style ("Suggested tags"). For indexing: include ML tags with confidence ≥ 0.7 in search; exclude below that threshold to prevent low-quality matches. For AND search: should a 0.6-confidence ML tag satisfy a required tag? No — require confidence ≥ 0.85 for ML tags to satisfy AND criteria, same as user tags. Track user acceptance rate of ML-suggested tags (did they confirm or dismiss?) — use as feedback signal to retrain the tagging model quarterly.

Content Tagging System Low-Level Design: Normalization, Autocomplete, and Tag-Based Search

Core Data Model

Normalizing and Applying Tags

Tag Autocomplete (Redis Sorted Sets)

Tag-Based Content Search

Key Interview Points