How does a knowledge base AI answer questions from company documents?

RAG pipeline: (1) Ingestion: extract text from documents (Markdown, HTML, PDF, Docs), chunk at semantic boundaries (500-1000 tokens with overlap), embed each chunk, store in a vector database with metadata (source, author, permissions). (2) Query: embed the user question, ANN search for top-K similar chunks, filter by user permissions, re-rank with a cross-encoder. (3) Prompt: system instruction (answer from context only) + top-N retrieved chunks with source citations + user question. (4) Generate: LLM produces an answer grounded in context. Include citations as clickable links. (5) Post-process: verify faithfulness, show confidence based on retrieval scores. Complex queries: multi-hop (multiple retrieval steps), comparative (retrieve from multiple sources), and temporal (versioned documents). Freshness: webhook-based re-indexing within minutes of document edits. Periodic full re-crawl every 24 hours catches missed changes.

System Design: Design Notion AI / Knowledge Base AI — Semantic Search, Document QA, Embeddings, Context Retrieval

⏱ 6 min read

AI-powered knowledge bases (Notion AI, Confluence AI, Glean) answer questions by searching and synthesizing information from organizational documents. Designing a knowledge base AI tests your understanding of: document chunking and embedding, semantic search, retrieval-augmented generation (RAG), prompt management, and the challenges of answering questions from diverse document types. This is a timely system design question as every productivity tool adds AI features.

Document Ingestion and Chunking

The knowledge base contains thousands of documents: wiki pages, meeting notes, project docs, and policies. Each must be processed for AI search. Ingestion pipeline: (1) Extract text — handle multiple formats: Markdown (Notion pages), HTML (Confluence), PDF (uploaded documents), Google Docs (via API), and Slack messages. Preserve document structure: headings, lists, tables, and code blocks. (2) Chunk — split long documents into search-friendly chunks. Strategy: split at semantic boundaries (headings, paragraphs) rather than fixed character counts. Target chunk size: 500-1000 tokens with 100-token overlap between consecutive chunks. Why overlap: a question may span a chunk boundary. Overlap ensures relevant information is not split across two chunks that individually miss the point. (3) Enrich metadata — each chunk stores: source_document_id, document_title, chunk_position (first/middle/last), headings_hierarchy (what section is this chunk in?), author, last_modified, and access_permissions. (4) Embed — generate a vector embedding for each chunk using a text embedding model (OpenAI text-embedding-3-small: 1536 dimensions, or open-source: E5-large, BGE-large). The embedding captures the semantic meaning. (5) Store — insert the chunk text, metadata, and embedding into the vector database (pgvector, Pinecone, or Qdrant). Index for fast ANN retrieval. Incremental updates: when a document is edited, re-chunk and re-embed only the changed sections. Detect changes via document last_modified timestamp or edit webhooks from the source platform.

Question Answering Pipeline

When a user asks: “What is our refund policy for enterprise customers?” (1) Query embedding — embed the question with the same embedding model used for chunks. This places the question in the same vector space as the document chunks. (2) Retrieval — ANN search in the vector database: find the K most similar chunks (K=5-10). These are the candidate context chunks. Filter by: access permissions (the user can only see chunks from documents they have access to) and optionally by document type or recency. (3) Re-ranking — the initial retrieval uses bi-encoder similarity (fast but coarse). A cross-encoder re-ranker processes each (question, chunk) pair together for more accurate relevance scoring. Re-rank the top-K candidates and select the top-N (N=3-5) for the final context. (4) Prompt construction — assemble the LLM prompt: system instruction (“Answer based on the provided context. If the answer is not in the context, say you do not know.”) + retrieved chunks (with source citations) + the user question. (5) Generation — the LLM generates an answer grounded in the retrieved context. Include citations: “According to the Enterprise Refund Policy document (last updated March 2026), enterprise customers are eligible for…” (6) Post-processing — verify the answer references the provided context (basic faithfulness check). Format citations as clickable links to the source documents. Display confidence: if the retrieval scores are low, show a disclaimer (“I found limited information on this topic”).

Handling Complex Queries

Simple factual questions (“What is the refund policy?”) are straightforward retrieval. Complex queries require more sophistication: (1) Multi-hop questions — “Who approved the budget increase for Project Atlas last quarter?” requires: finding Project Atlas documents, finding budget-related sections, identifying the approval, and finding the approver. Agentic RAG: the system performs multiple retrieval steps, refining its search based on intermediate findings. (2) Comparative questions — “How does our vacation policy differ between US and EU offices?” requires: retrieving BOTH the US and EU policy documents, extracting the relevant sections from each, and synthesizing a comparison. Multi-retrieval: search with multiple queries (“US vacation policy” AND “EU vacation policy”) and combine results. (3) Temporal questions — “What changed in the security policy since January?” requires: retrieving both the current version and the January version, and computing the diff. Document versioning: store historical versions and enable time-based retrieval. (4) Aggregation questions — “How many open incidents do we have?” cannot be answered from documents — it requires a live data query. Recognize the intent and route to the appropriate system (database query, API call) rather than document retrieval. Hybrid approach: combine document-based RAG with tool use (query databases, call APIs) for comprehensive answers.

Evaluation and Quality

Measuring RAG quality: (1) Retrieval quality — are the right chunks retrieved? Metrics: recall@K (what fraction of relevant chunks are in the top-K?), MRR (how high is the first relevant chunk ranked?). Evaluate with a test set of questions with known-relevant documents. (2) Answer quality — is the generated answer correct and complete? Metrics: faithfulness (does the answer only contain claims supported by the context?), relevance (does the answer address the question?), and completeness (does it cover all aspects?). Use LLM-as-judge: a stronger model evaluates the answer on these criteria. (3) User feedback — thumbs up/down on answers. Track: helpful rate (thumbs up / total answers), citation click rate (users find the sources useful), and reformulation rate (user asks a follow-up rephrasing the question = the first answer was insufficient). Improving quality: (1) Better chunking — experiment with chunk sizes and overlap. Smaller chunks (300 tokens) are more precise. Larger chunks (1000 tokens) provide more context. (2) Hybrid search — combine vector similarity with BM25 keyword matching. Some questions are best answered by keyword match (“error code 4532”) rather than semantic similarity. (3) Query expansion — the LLM reformulates the user question into a better search query: “refund policy enterprise” becomes “enterprise customer refund eligibility policy terms conditions.” (4) Fine-tuned embeddings — fine-tune the embedding model on your domain data (query-document pairs from user feedback). Domain-specific embeddings outperform generic models.

Permissions and Data Freshness

Permission enforcement: the AI must only answer from documents the user has access to. Implementation: (1) At retrieval time, filter by user permissions. The vector database query includes a metadata filter: accessible_by contains user_id OR user_groups. (2) At ingestion time, store the document ACL with each chunk. When permissions change (user removed from a Notion workspace, document sharing revoked): immediately update the chunk metadata in the vector database. A user removed at 3 PM should not see answers from that document at 3:01 PM. (3) For shared AI (team-wide assistant): the answer should not reveal information from a document that any member of the audience cannot access. In a channel-wide AI response: only use documents accessible to ALL channel members. Data freshness: documents change frequently. A policy updated yesterday should be reflected in today answers. Freshness strategy: (1) Webhook-based — when a document is edited (Notion webhook, Confluence event), re-process the changed document within minutes. (2) Periodic full re-index — every 24 hours, re-crawl all documents and detect changes. Catches: edits missed by webhooks and deleted documents (remove from index). (3) Source attribution with date — every answer includes “Source: Security Policy (last updated April 18, 2026).” Users can verify the answer is based on current information. If the source is stale (> 90 days old): show a warning (“This information may be outdated”).