Low Level Design: Web Scraper Service

What Is a Web Scraper Service?

A Web Scraper Service is the component responsible for fetching raw HTML from target URLs, executing JavaScript if needed, extracting structured data, and persisting results for downstream consumers. It differs from a general crawler in that it targets specific schemas on known sites rather than discovering the open web. The design must handle rate limiting, dynamic rendering, anti-bot measures, and schema evolution across many target sites simultaneously.

Data Model / Schema

-- Scrape targets configuration
CREATE TABLE scrape_targets (
    target_id    INT PRIMARY KEY AUTO_INCREMENT,
    name         VARCHAR(128) NOT NULL,
    base_url     TEXT NOT NULL,
    extractor_fn VARCHAR(128) NOT NULL,  -- name of parser class/function
    rate_limit   INT DEFAULT 1,          -- requests per second
    js_render    BOOLEAN DEFAULT FALSE,
    active       BOOLEAN DEFAULT TRUE
);

-- Scrape job queue
CREATE TABLE scrape_jobs (
    job_id       BIGINT PRIMARY KEY AUTO_INCREMENT,
    target_id    INT REFERENCES scrape_targets(target_id),
    url          TEXT NOT NULL,
    status       ENUM('queued', 'running', 'done', 'failed'),
    attempt      SMALLINT DEFAULT 0,
    worker_id    VARCHAR(64),
    created_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at   TIMESTAMP
);

-- Extracted structured results
CREATE TABLE scrape_results (
    result_id    BIGINT PRIMARY KEY AUTO_INCREMENT,
    job_id       BIGINT REFERENCES scrape_jobs(job_id),
    target_id    INT,
    url          TEXT,
    payload      JSON,
    schema_ver   SMALLINT,
    scraped_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Core Algorithm and Workflow

The service is organized into three layers:

  1. Fetch Layer: Workers dequeue jobs from scrape_jobs (or a Kafka topic). Each worker maintains a per-target token bucket for rate limiting. Static pages are fetched via an HTTP client pool. JavaScript-heavy pages are sent to a headless browser pool (Playwright or Puppeteer workers) that return rendered HTML. Responses are streamed to avoid holding large payloads in memory.
  2. Extract Layer: Raw HTML is passed to the target-specific extractor function identified by extractor_fn. Extractors use CSS selectors or XPath patterns to pull structured fields. Results are validated against a JSON Schema before being accepted. Invalid results trigger a retry with a flag for manual review.
  3. Persist and Emit Layer: Validated payloads are written to scrape_results and also published to a downstream Kafka topic partitioned by target_id. Consumers (data warehouse loaders, API caches) subscribe independently. Deduplication is enforced by hashing the payload and checking for existing records with the same URL and hash within a 24-hour window.

Failure Handling

  • Network errors and timeouts: Jobs retry with exponential backoff up to a configurable maximum (e.g., 5 attempts). After exhausting retries, jobs move to failed and trigger an alert if the failure rate for a target exceeds a threshold.
  • Anti-bot detection: If a response returns a CAPTCHA page or a known block pattern, the worker marks the job as blocked and pauses all jobs for that target. A human operator or an automatic IP rotation mechanism intervenes before resuming.
  • Schema drift: Extractor functions are versioned. If the extracted field count drops below an expected minimum, the result is rejected and a schema-drift alert fires, prompting an extractor update. Old results remain queryable via schema_ver.
  • Worker crash: Jobs assigned to a crashed worker have their updated_at timestamp stale; a watchdog resets them to queued after a heartbeat timeout.

Scalability Considerations

  • Horizontal worker scaling: Fetch workers are stateless and scale independently. The job queue acts as the buffer. Workers for JS rendering are more expensive; scale those separately based on queue depth for JS-flagged jobs.
  • Proxy rotation: Route outbound requests through a proxy pool partitioned by target domain to distribute egress IPs and reduce block rates. Track ban events per proxy-domain pair and evict banned proxies automatically.
  • Extractor isolation: Run extractor functions in sandboxed subprocesses to prevent a buggy or malicious extractor from crashing the worker. Use resource limits (CPU, memory, timeout) enforced by the OS.
  • Result storage tiering: Keep recent results (last 30 days) in a hot relational store for fast lookup. Archive older results to columnar storage (Parquet on S3) for analytics queries.

Summary

A production Web Scraper Service is more than an HTTP client with regex. The key design decisions are separating fetch, extract, and persist into independent layers; versioning extractors to handle schema drift; enforcing per-target rate limiting via token buckets; and building robust failure handling for blocks, timeouts, and validation errors. The stateless worker model makes horizontal scaling straightforward, with the job queue and proxy pool as the primary operational concerns.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

Scroll to Top