Batch Processor Low-Level Design: Chunk Iteration, Checkpointing, and Retry Semantics

What Is a Batch Processor?

A batch processor reads a large dataset in discrete chunks, applies a transformation or computation to each chunk, and durably records progress so that failures can be resumed without reprocessing completed work. Use cases include nightly billing runs, ETL pipelines, report generation, and bulk notification sends. The core design challenge is balancing throughput against fault tolerance: large chunks maximize I/O efficiency but amplify the cost of a retry after failure.

Requirements

Functional Requirements

  • Process arbitrarily large datasets by reading configurable chunk sizes from the source.
  • Persist a checkpoint after each successful chunk so that a restart resumes from the last committed position.
  • Support at-least-once retry semantics: a failed chunk is retried up to a configurable maximum before being sent to a dead-letter store.
  • Report progress (chunks done, records processed, errors) in near-real time.
  • Support graceful shutdown that finishes the current chunk and flushes the checkpoint before exiting.

Non-Functional Requirements

  • Checkpoint writes must be durable; a crash after a commit must never lose that progress.
  • Chunk size should be tunable at runtime without code changes.
  • The processor should saturate I/O without overwhelming downstream systems.

Data Model

Job Record

  • job_id — UUID, primary key.
  • source_ref — pointer to input dataset (S3 URI, table name, Kafka topic + partition).
  • chunk_size — configurable integer, default 1000 records.
  • status — ENUM: PENDING, RUNNING, PAUSED, COMPLETE, FAILED.
  • created_at, updated_at.

Checkpoint Record

  • job_id — foreign key.
  • cursor — opaque string representing position in source (byte offset, primary key value, Kafka offset).
  • chunks_done, records_done, records_failed.
  • committed_at — timestamp of last durable write.

Core Algorithm: Chunk Iteration Loop

Initialization

Load the job record. If a checkpoint exists, restore the cursor; otherwise start from the source beginning. Acquire an advisory lock on job_id to prevent concurrent runs.

Chunk Fetch

Read exactly chunk_size records starting at the current cursor. If the source returns fewer records, mark as the final chunk. Record the next cursor position before processing begins.

Chunk Processing

Apply the business transformation to each record in the chunk. Collect per-record outcomes: success, transient error (retriable), or permanent error (dead-letter). Do not commit the checkpoint until the entire chunk is processed.

Checkpoint Write

Write the new cursor, updated counters, and timestamp in a single atomic transaction. Only advance the cursor after a successful commit. This guarantees at-least-once: if the process crashes after processing but before committing, the chunk is retried.

Retry Logic

  • On chunk failure, increment a retry counter stored alongside the checkpoint.
  • Apply exponential backoff: base delay doubled per attempt, capped at a maximum (e.g., 30 minutes).
  • After max retries, move the chunk range to the dead-letter table and advance the cursor past it.

API Design

  • POST /jobs — create a new batch job, returns job_id.
  • POST /jobs/{id}/start — enqueue for execution.
  • GET /jobs/{id}/status — returns current status, cursor, progress counters.
  • POST /jobs/{id}/pause — signals the worker to stop after the current chunk.
  • GET /jobs/{id}/dead-letter — paginated list of failed record ranges.

Scalability Considerations

For parallel batch processing, partition the source dataset into N non-overlapping ranges at job creation time. Each range becomes an independent sub-job with its own checkpoint. A coordinator tracks sub-job completion and marks the parent complete when all ranges finish. Use a work-stealing queue so faster workers pick up remaining ranges when slow workers fall behind. Limit parallelism via a semaphore to avoid overwhelming the downstream sink. Tune chunk size empirically: start at 1000, profile I/O wait vs. CPU time, and increase until throughput plateaus.

Progress Reporting and Observability

  • Emit a metric (Prometheus counter) on every chunk commit: records processed, duration, error count.
  • Write structured logs per chunk with job_id, chunk index, cursor, and latency.
  • Expose an estimated time remaining field on the status endpoint derived from throughput rate and remaining record count.
  • Alert when retry rate exceeds a threshold or when a job stalls without checkpoint progress for N minutes.

Summary

A well-designed batch processor separates the concerns of iteration, checkpointing, retry, and reporting. Committing the checkpoint only after a successful chunk write ensures at-least-once semantics with bounded re-processing. Configurable chunk sizes, dead-letter handling, and partitioned parallelism make the processor adaptable to datasets of any scale.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How should a batch processor's chunk iteration loop be designed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use keyset (cursor) pagination rather than OFFSET to avoid O(n) scans. Each iteration fetches a fixed-size chunk using WHERE id > last_seen_id ORDER BY id LIMIT chunk_size. Process the chunk, persist the checkpoint, then advance the cursor. This keeps memory bounded and allows safe resumption after failure without re-scanning processed rows.”
}
},
{
“@type”: “Question”,
“name”: “What is durable checkpointing in a batch processor?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After each chunk is successfully processed, the processor writes the highest processed ID (or offset) to a durable store — a database row, a Redis key with persistence, or an S3 object. On restart the processor reads this checkpoint and resumes from the last committed position, ensuring exactly-once progress even when the process crashes mid-batch.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement at-least-once retry with exponential backoff in a batch processor?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Wrap each chunk's processing in a try/catch. On transient failure, wait base_delay * 2^attempt + jitter before retrying, up to a max_attempts cap. Jitter (random fraction of the delay) prevents thundering herd when many workers retry simultaneously. Do not advance the checkpoint until the chunk succeeds, guaranteeing at-least-once delivery.”
}
},
{
“@type”: “Question”,
“name”: “How should a batch processor handle dead-letter items?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “After max_attempts exhaustion, move the failing record to a dead-letter table or queue with the original payload, error message, attempt count, and timestamp. The main batch continues processing remaining chunks. A separate remediation job or on-call engineer inspects the dead-letter store, fixes root-cause issues, and re-queues items for reprocessing.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top