Async Job System Low-Level Design: Queue Selection, Worker Lifecycle, and Observability

What Is an Async Job System?

An async job system decouples work submission from work execution. A caller enqueues a job and receives an acknowledgment immediately; one or more workers pick up that job and execute it in the background. This pattern is fundamental to web services that need to offload slow operations — image processing, email delivery, report generation, third-party API calls — without blocking the request thread. The design must address queue selection, worker lifecycle, job state management, and operational visibility.

Requirements

Functional Requirements

  • Callers submit jobs with a type, payload, and optional scheduling parameters (delay, priority).
  • Workers claim jobs, execute them, and report outcomes back to the system.
  • Failed jobs are retried with configurable backoff up to a maximum attempt count.
  • Callers can query job status and retrieve results or error details.
  • Workers support graceful shutdown: finish the current job before exiting.

Non-Functional Requirements

  • At-least-once delivery: a job must never be silently dropped.
  • Job throughput must scale horizontally by adding workers without system changes.
  • Observability: p95 job latency, queue depth, and error rate must be monitorable.

Data Model

Job Record

  • job_id — UUID v4, primary key.
  • job_type — string identifier mapping to a registered handler.
  • payload — JSON blob, max 64 KB; large inputs stored in object storage with a reference.
  • status — ENUM: PENDING, RUNNING, COMPLETE, FAILED, DEAD.
  • attempt_count, max_attempts.
  • visible_at — future timestamp for delayed/retried jobs; workers only claim jobs where visible_at <= NOW().
  • locked_until — heartbeat-based visibility timeout; prevents two workers from executing the same job.
  • result, error_message, created_at, updated_at.

Queue Selection

The choice of backing queue depends on throughput, ordering, and durability requirements:

  • Database-backed queue (PostgreSQL SKIP LOCKED): simplest operational model, ACID durability, suits up to ~1000 jobs/sec. Use SELECT ... FOR UPDATE SKIP LOCKED to atomically claim jobs.
  • Redis (BRPOPLPUSH / streams): sub-millisecond enqueue latency, suitable for high-frequency lightweight jobs. Use Redis Streams with consumer groups for at-least-once delivery.
  • Message broker (SQS, RabbitMQ, Kafka): purpose-built for async messaging, provides visibility timeout, dead-letter queues, and fan-out natively. Best for cross-service job dispatch.

Worker Lifecycle

Startup

Register the worker instance (ID, hostname, supported job types) in a workers table. Begin polling the queue for claimable jobs matching registered types.

Job Claim

Atomically set job status to RUNNING, record worker_id, and set locked_until = NOW() + visibility_timeout. Only one worker can claim a given job due to the atomic update.

Heartbeat

While executing, the worker periodically extends locked_until. If the worker crashes, the lock expires and a supervisor or another worker can reclaim the job after the timeout elapses.

Completion

On success: set status to COMPLETE, write result. On retriable failure: increment attempt_count, set visible_at to NOW() + backoff, reset status to PENDING. On permanent failure (max attempts reached): set status to DEAD and write to dead-letter log.

Graceful Shutdown

On SIGTERM, stop claiming new jobs. Finish the current job. Deregister from the workers table. Exit cleanly. Kubernetes readiness probes should fail immediately on SIGTERM receipt to stop new traffic.

API Design

  • POST /jobs — enqueue a job; returns job_id and estimated start time.
  • GET /jobs/{id} — returns status, attempt_count, result or error.
  • DELETE /jobs/{id} — cancel a PENDING job.
  • GET /jobs?type=X&status=FAILED — filtered listing for operators.
  • POST /jobs/{id}/retry — manually re-enqueue a DEAD job.

Observability

  • Emit counters: jobs enqueued, jobs completed, jobs failed, jobs dead-lettered — labelled by job_type.
  • Emit histograms: job execution duration, queue wait time (time from PENDING to RUNNING).
  • Alert on queue depth exceeding a threshold for more than N minutes, indicating worker starvation.
  • Structured logs per job execution with job_id, type, attempt, duration, and outcome enable per-job audit trails.

Scalability Considerations

Scale workers horizontally; the queue acts as a natural load balancer. Use priority queues or separate queues per job type to prevent low-priority bulk jobs from starving latency-sensitive ones. Auto-scale worker pools based on queue depth metrics via KEDA or a similar operator. For very high throughput, shard the job table by job_type and use connection pooling (PgBouncer) to prevent database saturation from polling workers.

Summary

An async job system requires careful attention to atomicity in job claiming, heartbeat-based lock extension for crash recovery, and structured retry with backoff. Queue selection should match the operational complexity budget and throughput target. Rich observability — counters, histograms, and depth alerts — makes the system maintainable as job volume grows.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top