Question 1

How do you normalize tags to prevent duplicates like "Machine Learning" and "machine-learning"?

Accepted Answer

Normalize on write: lowercase → trim whitespace → replace spaces and underscores with hyphens → strip non-alphanumeric characters → collapse multiple hyphens. Apply this function to every tag before lookup or insert. Store only the normalized form in the Tag table with a UNIQUE constraint on name. When displaying tags, show the normalized form. "Machine Learning", "machine_learning", and "machine-learning" all normalize to "machine-learning" and resolve to the same Tag row. Run normalization in application code (not just the DB) so lookups also normalize before searching.

Question 2

How do you implement tag autocomplete with sub-50ms latency?

Accepted Answer

Create a GIN trigram index: CREATE INDEX ON Tag USING GIN(name gin_trgm_ops). This enables fast ILIKE prefix queries (name ILIKE 'mach%') and fuzzy matching (name % 'machin'). Sort results by: exact prefix matches first, then by usage_count DESC (popular tags first). Cache autocomplete results in Redis with TTL=60s for the most common prefixes (single characters and two-character prefixes receive the most traffic). For a 10-character input, the result set is small and stable — cache aggressively.

Question 3

How do you find related tags (tags that appear together)?

Accepted Answer

Self-join ContentTag: SELECT t2.tag_id, COUNT(*) as co_count FROM ContentTag t1 JOIN ContentTag t2 ON t1.content_id=t2.content_id AND t2.tag_id!=t1.tag_id WHERE t1.tag_id=%(id)s GROUP BY t2.tag_id ORDER BY co_count DESC LIMIT 10. This finds all tags that appear on the same content as the target tag, ranked by co-occurrence frequency. This query is expensive — run it asynchronously and cache results per tag (TTL=1h). Update the cache when new content is tagged. For real-time related tags on a high-traffic site, precompute nightly and store in a TagCoOccurrence table.

Question 4

How do you enforce a maximum number of tags per item?

Accepted Answer

Validate at the application layer before the database write: count the existing tags for the content item and reject the request if adding the new tags would exceed the limit. Use a set_tags approach (replace the entire tag list in one operation) rather than individual add/remove calls, so the limit check is atomic — no race condition between two concurrent add operations both seeing the count as under the limit. For the database layer: you can add a CHECK constraint using a trigger or a partial index, but application-level validation with clear error messages is the primary enforcement.

Question 5

How do you efficiently paginate through all content with a specific tag?

Accepted Answer

Index ContentTag on (tag_id, content_id, content_type). Use cursor pagination with the content's created_at as the cursor: WHERE tag_id=%(tid)s AND content_type=%(ctype)s AND c.created_at < %(cursor)s ORDER BY c.created_at DESC LIMIT 20. This keyset scan is O(log N) regardless of how many items have the tag. For "top" sorting (by upvote or view count), the cursor becomes a compound (score, content_id) tuple. Avoid OFFSET — at page 100, OFFSET 2000 scans and discards 2000 rows even though only 20 are returned.

Tagging System Low-Level Design

Tagging System — Low-Level Design

Core Data Model

Tag Normalization

Adding and Removing Tags

Tag Autocomplete

Browsing Content by Tag

Key Interview Points