What Is a Web Scraper Service?
A Web Scraper Service is the component responsible for fetching raw HTML from target URLs, executing JavaScript if needed, extracting structured data, and persisting results for downstream consumers. It differs from a general crawler in that it targets specific schemas on known sites rather than discovering the open web. The design must handle rate limiting, dynamic rendering, anti-bot measures, and schema evolution across many target sites simultaneously.
Data Model / Schema
-- Scrape targets configuration
CREATE TABLE scrape_targets (
target_id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(128) NOT NULL,
base_url TEXT NOT NULL,
extractor_fn VARCHAR(128) NOT NULL, -- name of parser class/function
rate_limit INT DEFAULT 1, -- requests per second
js_render BOOLEAN DEFAULT FALSE,
active BOOLEAN DEFAULT TRUE
);
-- Scrape job queue
CREATE TABLE scrape_jobs (
job_id BIGINT PRIMARY KEY AUTO_INCREMENT,
target_id INT REFERENCES scrape_targets(target_id),
url TEXT NOT NULL,
status ENUM('queued', 'running', 'done', 'failed'),
attempt SMALLINT DEFAULT 0,
worker_id VARCHAR(64),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP
);
-- Extracted structured results
CREATE TABLE scrape_results (
result_id BIGINT PRIMARY KEY AUTO_INCREMENT,
job_id BIGINT REFERENCES scrape_jobs(job_id),
target_id INT,
url TEXT,
payload JSON,
schema_ver SMALLINT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Core Algorithm and Workflow
The service is organized into three layers:
- Fetch Layer: Workers dequeue jobs from
scrape_jobs(or a Kafka topic). Each worker maintains a per-target token bucket for rate limiting. Static pages are fetched via an HTTP client pool. JavaScript-heavy pages are sent to a headless browser pool (Playwright or Puppeteer workers) that return rendered HTML. Responses are streamed to avoid holding large payloads in memory. - Extract Layer: Raw HTML is passed to the target-specific extractor function identified by
extractor_fn. Extractors use CSS selectors or XPath patterns to pull structured fields. Results are validated against a JSON Schema before being accepted. Invalid results trigger a retry with a flag for manual review. - Persist and Emit Layer: Validated payloads are written to
scrape_resultsand also published to a downstream Kafka topic partitioned bytarget_id. Consumers (data warehouse loaders, API caches) subscribe independently. Deduplication is enforced by hashing the payload and checking for existing records with the same URL and hash within a 24-hour window.
Failure Handling
- Network errors and timeouts: Jobs retry with exponential backoff up to a configurable maximum (e.g., 5 attempts). After exhausting retries, jobs move to
failedand trigger an alert if the failure rate for a target exceeds a threshold. - Anti-bot detection: If a response returns a CAPTCHA page or a known block pattern, the worker marks the job as blocked and pauses all jobs for that target. A human operator or an automatic IP rotation mechanism intervenes before resuming.
- Schema drift: Extractor functions are versioned. If the extracted field count drops below an expected minimum, the result is rejected and a schema-drift alert fires, prompting an extractor update. Old results remain queryable via
schema_ver. - Worker crash: Jobs assigned to a crashed worker have their
updated_attimestamp stale; a watchdog resets them toqueuedafter a heartbeat timeout.
Scalability Considerations
- Horizontal worker scaling: Fetch workers are stateless and scale independently. The job queue acts as the buffer. Workers for JS rendering are more expensive; scale those separately based on queue depth for JS-flagged jobs.
- Proxy rotation: Route outbound requests through a proxy pool partitioned by target domain to distribute egress IPs and reduce block rates. Track ban events per proxy-domain pair and evict banned proxies automatically.
- Extractor isolation: Run extractor functions in sandboxed subprocesses to prevent a buggy or malicious extractor from crashing the worker. Use resource limits (CPU, memory, timeout) enforced by the OS.
- Result storage tiering: Keep recent results (last 30 days) in a hot relational store for fast lookup. Archive older results to columnar storage (Parquet on S3) for analytics queries.
Summary
A production Web Scraper Service is more than an HTTP client with regex. The key design decisions are separating fetch, extract, and persist into independent layers; versioning extractors to handle schema drift; enforcing per-target rate limiting via token buckets; and building robust failure handling for blocks, timeouts, and validation errors. The stateless worker model makes horizontal scaling straightforward, with the job queue and proxy pool as the primary operational concerns.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering