How does enterprise search handle permissions without leaking confidential documents?

Permission-aware search approaches: (1) Post-filtering (naive): search without permission checks, then filter inaccessible results. Problem: if top 100 results are all inaccessible, user sees nothing. Must over-fetch. (2) Security field indexing (standard): for each document, index the list of users/groups that can access it in an accessible_by field. The search query includes: AND (accessible_by:user_id OR accessible_by:group1 OR ...). The inverted index evaluates this filter efficiently. (3) Permission propagation: on permission changes (user removed from channel), immediately update the accessible_by field. Critical: a removed user must not see results within seconds of the change. Batch updates every 10 seconds for high-frequency changes. Challenges: nested groups (user in Group A which is in Group B which has access), inherited permissions (folder permissions cascade to documents), and cross-system permissions (a Confluence page shared differently than the linked Jira ticket).

How does enterprise search ranking differ from web search?

Key differences from web search: No PageRank (internal docs lack link structure). Freshness matters more (recent messages >> old ones). Organizational context is critical (my team docs > other teams). Ranking signals: (1) BM25 text relevance (title weighted 3x). (2) Freshness decay (exponential -- today beats 6 months ago). (3) Organizational proximity (searcher team, channels, frequent collaborators boost). (4) Engagement (frequently viewed/edited documents are likely important). (5) Document quality (structured documents > one-liners). (6) Semantic similarity (embedding-based, captures meaning beyond keywords). Hybrid search: BM25 (keyword) + vector similarity (semantic) merged with Reciprocal Rank Fusion. Personalization: the same query means different things to different people. Use searcher role, team, and recent activity to personalize.

System Design: Design Enterprise Search (Slack/Confluence) — NLP Query Understanding, Faceted Search, Permissions

⏱ 6 min read

Enterprise search powers knowledge discovery within organizations — searching across documents, messages, code, and wikis. Unlike web search (rank by popularity), enterprise search must respect permissions, understand organizational context, and handle diverse content types. This guide covers the architecture of enterprise search systems like Slack Search, Confluence Search, and Glean — a system design question increasingly asked at productivity and SaaS companies.

Indexing Pipeline

Enterprise search indexes content from multiple sources: messages (Slack, Teams), documents (Google Drive, Confluence, Notion), code (GitHub, GitLab), tickets (Jira, Linear), and email (Gmail, Outlook). Ingestion: (1) Connectors — one per source. Each connector: authenticates via OAuth, crawls content (messages, documents, pages), extracts: text content, metadata (author, timestamp, channel/space, tags), and permissions (who can access this content). Incremental sync: after initial full crawl, listen for change events (webhooks, polling) to index new/modified content within seconds. (2) Processing — for each document: extract text from various formats (HTML, PDF, DOCX, Markdown), run NLP enrichment (entity extraction, topic classification, language detection), compute embeddings (for semantic search), and store the processed document in the search index. (3) Permissions indexing — for each document, index the ACL (Access Control List): which users, groups, or roles can read it. This is the hardest part of enterprise search: permissions are complex (nested groups, inherited permissions, per-document overrides) and change frequently. A permission change must propagate to the search index immediately (a user removed from a channel should not see its messages in search results).

Query Understanding

Enterprise search queries are often imprecise: “that architecture doc from last month” or “John quarterly report.” Query understanding transforms the raw query into a structured search: (1) Intent classification — is the user looking for: a specific document (navigational: “onboarding checklist”), information (informational: “how do I request PTO”), or a person/team (entity: “who owns the billing service”). (2) Entity recognition — identify: people (“John” -> John Smith, Engineering), dates (“last month” -> March 2026), projects (“Project Atlas”), and channels/spaces. (3) Query expansion — add synonyms and related terms. “PTO” expands to “PTO OR paid time off OR vacation.” Learn expansions from click data (queries that lead to clicks on the same document are related). (4) Spelling correction — “kuberntes” -> “kubernetes.” Use an edit-distance model trained on the organization vocabulary. (5) Personal context — the same query means different things to different people. “standup notes” for an engineer returns their team standup channel. For a PM, it returns the product standup. Use the searcher team, recent activity, and role to personalize results. Autocomplete: as the user types, suggest: recent searches, document titles, people names, and channel names. Prioritize by: recency (recently accessed content), frequency (frequently searched terms), and relevance (matching the user context).

Ranking and Relevance

Enterprise search ranking differs from web search: no PageRank (internal documents do not have link structure), freshness matters more (recent messages are more relevant), and organizational context is critical (my team documents are more relevant than another team documents). Ranking signals: (1) Text relevance — BM25 score on title, body, and metadata fields. Title matches weighted 3x. (2) Freshness — exponential decay: recent documents score higher. A message from today is more relevant than one from 6 months ago (all else equal). (3) Organizational proximity — documents from the searcher team, channels they are in, and people they interact with frequently score higher. (4) Engagement — documents frequently viewed, edited, or shared are likely important. Click-through data from search results improves ranking over time. (5) Document quality — longer, well-structured documents (with headings, images, proper formatting) score higher than one-line messages. (6) Semantic similarity — embedding-based similarity between the query and document content. Captures meaning beyond keyword matching. “How do I deploy?” matches a document titled “Deployment Guide” even without the word “deploy” in the query. Hybrid search: combine BM25 (keyword) with vector similarity (semantic). Reciprocal Rank Fusion (RRF) merges the two ranked lists. This captures both exact keyword matches and semantic relevance.

Permission-Aware Search

The core constraint: a user must only see search results they have permission to access. A search result showing the title of a confidential document (even without the content) is a permission violation. Implementation approaches: (1) Post-filtering — execute the search without permission checks, then filter results the user cannot access. Problem: if the top 100 results are all inaccessible, the user sees 0 results despite there being relevant accessible content further down. The system must fetch extra results to compensate. Wasteful for restricted content. (2) Pre-filtering — include the user accessible document IDs (or group membership) as a filter in the search query. The index only returns accessible documents. Problem: a user in 100 groups with access to 1M documents creates a large filter clause. Elasticsearch performance degrades with very large filter lists. (3) Security field indexing — for each document, index the list of users/groups that can access it. The search query includes: AND (accessible_by:user_id OR accessible_by:group1 OR accessible_by:group2 …). The index evaluates this filter efficiently using an inverted index on the accessible_by field. This is the standard approach. Challenge: permission changes must propagate instantly. If a user is removed from a channel at 3:00 PM, by 3:01 PM they should not see that channel messages in search. Implementation: on permission change events (user removed from channel, document sharing revoked), immediately update the accessible_by field in the search index. For high-frequency permission changes: batch updates every 10 seconds rather than per-event to reduce index pressure.

Federated Search and Result Presentation

Enterprise search spans multiple content types. The results page shows a unified view: (1) Federated search — the query is executed against multiple indexes simultaneously: messages index, documents index, code index, people index. Each returns its top-K results. The frontend merges and groups results by type. (2) Result grouping — show results grouped by category: “Messages (12 results)” with top 3, “Documents (5 results)” with top 3, “People (2 matches).” The user can expand each category or see all results in a unified ranked list. (3) Snippet generation — for each result, show a relevant excerpt. Highlight the query terms within the excerpt. For long documents: show the paragraph most relevant to the query (not just the beginning). Use BM25 on paragraphs within the document to select the best snippet. (4) Filters and facets — allow refinement by: content type (messages, documents, code), author, date range, channel/space, and file type. Facet counts show how many results match each filter value. (5) Instant answers — for simple questions (“What is the Wi-Fi password?”, “When is the company holiday party?”), extract the answer directly from the source document and show above the results. Use an extractive QA model (BERT on the top-ranked document) or RAG (LLM generates answer from retrieved context). This provides instant value without requiring the user to click into a document.