What Is a Web Scraper Service?
A Web Scraper Service is the component responsible for fetching raw HTML from target URLs, executing JavaScript if needed, extracting structured data, and persisting results for downstream consumers. It differs from a general crawler in that it targets specific schemas on known sites rather than discovering the open web. The design must handle rate limiting, dynamic rendering, anti-bot measures, and schema evolution across many target sites simultaneously.
Data Model / Schema
-- Scrape targets configuration
CREATE TABLE scrape_targets (
target_id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(128) NOT NULL,
base_url TEXT NOT NULL,
extractor_fn VARCHAR(128) NOT NULL, -- name of parser class/function
rate_limit INT DEFAULT 1, -- requests per second
js_render BOOLEAN DEFAULT FALSE,
active BOOLEAN DEFAULT TRUE
);
-- Scrape job queue
CREATE TABLE scrape_jobs (
job_id BIGINT PRIMARY KEY AUTO_INCREMENT,
target_id INT REFERENCES scrape_targets(target_id),
url TEXT NOT NULL,
status ENUM('queued', 'running', 'done', 'failed'),
attempt SMALLINT DEFAULT 0,
worker_id VARCHAR(64),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP
);
-- Extracted structured results
CREATE TABLE scrape_results (
result_id BIGINT PRIMARY KEY AUTO_INCREMENT,
job_id BIGINT REFERENCES scrape_jobs(job_id),
target_id INT,
url TEXT,
payload JSON,
schema_ver SMALLINT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Core Algorithm and Workflow
The service is organized into three layers:
- Fetch Layer: Workers dequeue jobs from
scrape_jobs(or a Kafka topic). Each worker maintains a per-target token bucket for rate limiting. Static pages are fetched via an HTTP client pool. JavaScript-heavy pages are sent to a headless browser pool (Playwright or Puppeteer workers) that return rendered HTML. Responses are streamed to avoid holding large payloads in memory. - Extract Layer: Raw HTML is passed to the target-specific extractor function identified by
extractor_fn. Extractors use CSS selectors or XPath patterns to pull structured fields. Results are validated against a JSON Schema before being accepted. Invalid results trigger a retry with a flag for manual review. - Persist and Emit Layer: Validated payloads are written to
scrape_resultsand also published to a downstream Kafka topic partitioned bytarget_id. Consumers (data warehouse loaders, API caches) subscribe independently. Deduplication is enforced by hashing the payload and checking for existing records with the same URL and hash within a 24-hour window.
Failure Handling
- Network errors and timeouts: Jobs retry with exponential backoff up to a configurable maximum (e.g., 5 attempts). After exhausting retries, jobs move to
failedand trigger an alert if the failure rate for a target exceeds a threshold. - Anti-bot detection: If a response returns a CAPTCHA page or a known block pattern, the worker marks the job as blocked and pauses all jobs for that target. A human operator or an automatic IP rotation mechanism intervenes before resuming.
- Schema drift: Extractor functions are versioned. If the extracted field count drops below an expected minimum, the result is rejected and a schema-drift alert fires, prompting an extractor update. Old results remain queryable via
schema_ver. - Worker crash: Jobs assigned to a crashed worker have their
updated_attimestamp stale; a watchdog resets them toqueuedafter a heartbeat timeout.
Scalability Considerations
- Horizontal worker scaling: Fetch workers are stateless and scale independently. The job queue acts as the buffer. Workers for JS rendering are more expensive; scale those separately based on queue depth for JS-flagged jobs.
- Proxy rotation: Route outbound requests through a proxy pool partitioned by target domain to distribute egress IPs and reduce block rates. Track ban events per proxy-domain pair and evict banned proxies automatically.
- Extractor isolation: Run extractor functions in sandboxed subprocesses to prevent a buggy or malicious extractor from crashing the worker. Use resource limits (CPU, memory, timeout) enforced by the OS.
- Result storage tiering: Keep recent results (last 30 days) in a hot relational store for fast lookup. Archive older results to columnar storage (Parquet on S3) for analytics queries.
Summary
A production Web Scraper Service is more than an HTTP client with regex. The key design decisions are separating fetch, extract, and persist into independent layers; versioning extractors to handle schema drift; enforcing per-target rate limiting via token buckets; and building robust failure handling for blocks, timeouts, and validation errors. The stateless worker model makes horizontal scaling straightforward, with the job queue and proxy pool as the primary operational concerns.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the key components of a scalable web scraper system design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A scalable web scraper includes a URL frontier (queue of pages to fetch), a downloader layer that makes HTTP requests, an HTML parser that extracts structured data, a storage layer for raw pages and parsed results, and a deduplication mechanism to avoid processing the same page twice. Rate limiting, proxy rotation, and robots.txt compliance are also essential components.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle JavaScript-rendered pages in a web scraper?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “JavaScript-rendered pages require a headless browser such as Puppeteer or Playwright instead of a plain HTTP fetcher. These tools execute JavaScript in a real browser engine and return the fully rendered DOM. Because headless browsers are resource-intensive, a common pattern is to use a fast HTML-only fetcher first and fall back to the headless browser only when the response is detected to be a JS-heavy SPA or when required content is missing from the raw HTML.”
}
},
{
“@type”: “Question”,
“name”: “How do you design a web scraper to handle anti-scraping measures?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Anti-scraping countermeasures include IP-based rate limiting, CAPTCHAs, user-agent fingerprinting, and honeypot links. Mitigations include rotating residential or datacenter proxies, randomizing request headers and timing, using browser fingerprint spoofing, and integrating CAPTCHA-solving services when necessary. For sites like Airbnb or Amazon, scraping is often against the Terms of Service, so legitimate use cases rely on official APIs or data partnerships instead.”
}
},
{
“@type”: “Question”,
“name”: “How do you store and process the data extracted by a web scraper at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raw HTML is typically stored in an object store (e.g., S3 or GCS) keyed by URL hash and crawl timestamp, enabling reprocessing without re-fetching. Parsed structured data flows into a streaming pipeline (e.g., Kafka) for downstream consumers. A columnar format like Parquet in a data lake supports analytical queries over large scrape datasets. Change detection diffs can be computed between successive crawls to identify updated content efficiently.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering