What Is a Web Crawler and Search Indexer?
A web crawler systematically downloads web pages. A search indexer processes those pages, extracts text, and builds an inverted index that maps keywords to the pages containing them. Together, they form the data collection pipeline behind every search engine. Google’s crawler (Googlebot) visits hundreds of billions of pages; the index supports billions of queries per day.
System Requirements
Functional
- Crawl the web starting from seed URLs
- Extract text content, links, and metadata from HTML pages
- Build and maintain an inverted index
- Respect robots.txt and crawl-delay directives
- Handle duplicate content across URLs
- Recrawl pages to update index as content changes
Non-Functional
- Scale to billions of pages
- Crawl politely — don’t overwhelm any single server
- Index freshness: news pages refresh in minutes, static pages in months
Crawler Architecture
URL Frontier
The URL frontier is the queue of URLs to crawl. Requirements: (1) prioritize important URLs (high PageRank, frequently updated), (2) politeness — don’t crawl the same host too fast, (3) deduplication — don’t crawl the same URL twice.
Implementation: a two-level priority queue. Front queues (1 per priority tier) feed into back queues (1 per hostname). A crawler worker picks a back queue, waits for the crawl-delay of that hostname to expire, then fetches the URL.
Deduplication
Use a Bloom filter (fast, memory-efficient, probabilistic) as the first layer. If the URL is definitely not seen, add to frontier. If probably seen, check the database to confirm. This reduces database lookups by ~99% while accepting a small false-positive rate (URL accidentally skipped). For content deduplication: compute SimHash of page content. Pages with SimHash Hamming distance <3 are near-duplicates; index only one.
HTML Parsing and Link Extraction
Parse HTML to extract: (1) visible text content for indexing, (2) outgoing links for further crawling, (3) canonical URL (handles URL normalization — trailing slashes, query parameter order), (4) noindex/nofollow directives, (5) last-modified header for recrawl scheduling.
Politeness and robots.txt
Before crawling any host, fetch and parse robots.txt. Cache per host. Respect Disallow rules (don’t crawl blocked paths), Crawl-delay directive (wait N seconds between requests to the same host), and User-agent directives (specific rules for named crawlers). Rate limit per host: configurable, typically 1 request/second to avoid overloading small servers.
Indexer Architecture
Text Processing Pipeline
- Tokenization: split text into words, handle punctuation
- Normalization: lowercase, remove accents
- Stop word removal: remove “the”, “a”, “is” — too common to be useful for ranking
- Stemming/Lemmatization: reduce “running”, “runs”, “ran” to “run”
- Position tracking: record which position each term appears at (needed for phrase queries)
Inverted Index Structure
Maps term → posting list. Each posting: (doc_id, term_frequency, positions[]).
Index structure:
"python" → [(doc_1, tf=5, [10,45,102]), (doc_3, tf=2, [7,88]), ...]
"interview" → [(doc_1, tf=3, [11,50,200]), (doc_2, tf=8, [1,3,5,...]), ...]
Posting lists are sorted by doc_id (enables efficient set intersection for AND queries) or by TF-IDF score (enables top-K retrieval without full list traversal).
Index Storage at Scale
The full web index is terabytes — cannot fit in RAM. Google uses a custom distributed file system (GFS) and BigTable. Open-source equivalent: build with Lucene segments (the underlying engine of Elasticsearch). Each Lucene segment is an immutable inverted index. New documents go into a fresh segment. Segments are merged periodically (smaller → larger). Merging is background and doesn’t block queries. Deleted documents are marked with a deletion bitmap; physically removed during merges.
Distributed Indexing
Partition the index by URL hash (document-partitioned index): each shard holds the full index for a subset of documents. A query fans out to all shards, each returns its top-K results, the coordinator merges and re-ranks. Alternatively: term-partitioned index (each shard holds postings for a subset of terms). Document-partitioned is preferred in practice — simpler sharding, easier to add/remove documents.
Recrawl Strategy
Not all pages change at the same rate. Classify by freshness category:
- News pages: recrawl every 15 minutes (detect via news sitemaps or ping APIs)
- High-traffic pages: recrawl every 1–7 days (use HTTP If-Modified-Since; 304 Not Modified skips reindexing)
- Static pages: recrawl monthly
- Dead links: exponential backoff — crawl less frequently until confirmed dead
Track historical change rate per URL to predict optimal next crawl time.
Interview Tips
- The URL frontier design (priority + politeness queues) is often the most interesting part — explain it carefully.
- Bloom filter for URL deduplication is a classic application — mention the false-positive rate and how you handle it.
- Distinguish the crawler (fetches pages) from the indexer (builds the inverted index) — they’re separate systems with separate scaling concerns.
- Mention sitemaps: websites publish sitemap.xml listing all their pages and last-modified dates — crawlers prioritize these over link discovery for fresh content.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the URL frontier ensure polite crawling without overloading servers?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “The URL frontier uses a two-level queue architecture. Front queues are priority-based (high-priority URLs like news sites go in high-priority queues). Back queues (one per hostname) ensure politeness: a crawler worker selects a back queue, checks if enough time has passed since the last request to that host (respecting robots.txt Crawl-Delay), then fetches the URL. If the crawl delay hasn't expired, the worker picks a different host's back queue. This guarantees: (1) no single host is flooded, (2) high-priority URLs are crawled before low-priority ones, (3) each hostname is crawled at a configurable rate (default: 1 req/sec for small sites). Robots.txt is fetched and cached per domain before crawling any pages on that domain.” }
},
{
“@type”: “Question”,
“name”: “How do you detect and handle duplicate web pages during crawling?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Two types of duplicates: URL duplicates (same page, different URL) and content duplicates (different URL, same/similar content). For URL duplicates: use a Bloom filter as the first check (fast, memory-efficient, probabilistic). If the URL is "definitely not seen," add to crawl frontier. If "possibly seen," check a persistent store (Redis set or database) to confirm. Bloom filters reduce database lookups by 99%+ at the cost of a small false-positive rate (URL accidentally skipped). For content duplicates: compute SimHash — a locality-sensitive hash where similar documents have similar hash values. Pages with Hamming distance < 3 are near-duplicates. Index only the canonical version (prefer HTTPS, shorter URL, more links pointing to it). SimHash is used by Google to detect near-duplicate web pages at scale.” }
},
{
“@type”: “Question”,
“name”: “How is an inverted index structured and how does it support boolean queries?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “An inverted index maps each term to a posting list: term → [(doc_id, tf, positions[]), …]. Posting lists are sorted by doc_id. For a boolean AND query ("python" AND "interview"): fetch both posting lists and perform a sorted merge intersection — advance the pointer of whichever list has the smaller doc_id until both lists are exhausted or a common doc_id is found. O(|A| + |B|) time. For OR: sorted merge union. For phrase queries ("python interview"): after finding common doc_ids, check position lists to confirm "interview" appears exactly one position after "python." Posting lists are stored in compressed format (delta encoding: store differences between consecutive doc_ids, which are small, then varint-encode) — reduces index size by 5-10x.” }
}
]
}