Q: How do you implement job priorities without starvation of low-priority jobs?

A pure priority queue processes high-priority jobs first. If high-priority jobs arrive continuously, low-priority jobs starve indefinitely — never executed. Solutions: (1) Aging: a background job increments a waiting_boost column every 5 minutes for pending jobs. The worker orders by (priority + waiting_boost) DESC — a low-priority job that has waited 4 hours accumulates enough boost to outrank newly arrived medium-priority jobs. (2) Dedicated queues: separate queues per priority level (queue_name: "critical", "normal", "batch"). Workers poll "critical" first; if empty, poll "normal"; if empty, poll "batch." Allocate worker capacity proportionally (50% critical, 40% normal, 10% batch). Low-priority jobs always get 10% of worker capacity regardless of critical queue depth. (3) Weighted fair queuing: process K high-priority jobs, then N normal-priority jobs, then M low-priority jobs in a round-robin pattern — implemented in the worker's poll logic.

Q: How do you implement job chaining and fan-out patterns?

Job chaining (job B starts after job A): store parent_job_id on job B. When job A completes, query SELECT * FROM Job WHERE parent_job_id=A_id AND status=pending and advance their run_at to NOW(). The enqueue call for job B includes parent_job_id=A_id and run_at far in the future — it won't run until job A explicitly triggers it. Fan-out (one job spawns many child jobs): job A runs, creates 100 child jobs (each processing one chunk), and creates a "sentinel" job with a dependency on all 100. The sentinel polls SELECT COUNT(*) FROM Job WHERE parent_job_id IN (child_ids) AND status != completed; when all children complete, the sentinel runs and aggregates results. For complex fan-out, use a workflow orchestrator pattern instead: a WorkflowJob table tracks the overall state, and each step transition is managed by the orchestrator rather than by polling parent_job_id.

Q: What is the difference between run_at scheduling and recurring jobs?

run_at scheduling: a one-time execution at a specific future time. enqueue(..., run_at=tomorrow_9am) creates one job row; it executes once when the worker polls and finds run_at <= NOW(). Use for: send reminder email in 3 days, retry a failed payment tomorrow, generate a monthly report on the 1st. Recurring jobs: defined by a cron expression; a scheduler mints new job instances on each tick. The RecurringJob table stores the template; the scheduler calls enqueue() at each scheduled time, creating a new Job row each time. Use for: nightly database cleanup, hourly data sync, daily digest email. Key difference: run_at jobs are finite (one execution), recurring jobs are indefinite (run until the RecurringJob.is_active is set to FALSE). Anti-pattern: implementing recurring behavior by having a job re-enqueue itself on completion — this creates a chain of single-use jobs with no central control point. Use RecurringJob instead for cleaner observability and easier pause/resume control.

Question 1

How does a job queue differ from a messaging queue?

Accepted Answer

A messaging queue (SQS, RabbitMQ) models delivery: a message is enqueued, delivered to a consumer, and the consumer acknowledges receipt. The queue tracks delivery state but not execution state — once a message is acknowledged, the queue doesn't know whether processing succeeded. A job queue models execution: a job is enqueued, claimed by a worker, executed, and the result (success or failure) is recorded. The queue tracks the full lifecycle — scheduled, running, completed, failed, retried. Job queues also provide: job deduplication (idempotency keys), result storage, job chaining (job B starts after job A completes), scheduling (run this job at 3 AM), and observability (how many jobs failed in the last hour, average execution time per job type). Use a messaging queue when you need maximum throughput and don't need execution tracking; use a job queue when you need retry logic, result storage, or scheduling.

Question 2

How does the heartbeat pattern prevent ghost jobs without a fixed execution timeout?

Accepted Answer

A fixed execution timeout (30 minutes) breaks long-running jobs that legitimately take longer — a nightly report that takes 45 minutes would be killed and retried in an infinite loop. The heartbeat pattern avoids a fixed timeout: the worker sends a keep-alive signal every N seconds (update started_at=NOW()). The cleanup job detects ghost jobs as: status=running AND started_at < NOW() - INTERVAL '2 * heartbeat_interval'. If the worker is alive, started_at is refreshed every heartbeat and never crosses this threshold. If the worker crashes, no heartbeat arrives, started_at falls behind, and the cleanup job resets the job to pending after 20 seconds. A 10-second heartbeat + 20-second ghost detection window means a crashed worker's job is re-queued within 30 seconds — fast enough for most use cases without a fixed runtime cap.

Question 3

How do you implement job priorities without starvation of low-priority jobs?

Accepted Answer

A pure priority queue processes high-priority jobs first. If high-priority jobs arrive continuously, low-priority jobs starve indefinitely — never executed. Solutions: (1) Aging: a background job increments a waiting_boost column every 5 minutes for pending jobs. The worker orders by (priority + waiting_boost) DESC — a low-priority job that has waited 4 hours accumulates enough boost to outrank newly arrived medium-priority jobs. (2) Dedicated queues: separate queues per priority level (queue_name: "critical", "normal", "batch"). Workers poll "critical" first; if empty, poll "normal"; if empty, poll "batch." Allocate worker capacity proportionally (50% critical, 40% normal, 10% batch). Low-priority jobs always get 10% of worker capacity regardless of critical queue depth. (3) Weighted fair queuing: process K high-priority jobs, then N normal-priority jobs, then M low-priority jobs in a round-robin pattern — implemented in the worker's poll logic.

Question 4

How do you implement job chaining and fan-out patterns?

Accepted Answer

Job chaining (job B starts after job A): store parent_job_id on job B. When job A completes, query SELECT * FROM Job WHERE parent_job_id=A_id AND status=pending and advance their run_at to NOW(). The enqueue call for job B includes parent_job_id=A_id and run_at far in the future — it won't run until job A explicitly triggers it. Fan-out (one job spawns many child jobs): job A runs, creates 100 child jobs (each processing one chunk), and creates a "sentinel" job with a dependency on all 100. The sentinel polls SELECT COUNT(*) FROM Job WHERE parent_job_id IN (child_ids) AND status != completed; when all children complete, the sentinel runs and aggregates results. For complex fan-out, use a workflow orchestrator pattern instead: a WorkflowJob table tracks the overall state, and each step transition is managed by the orchestrator rather than by polling parent_job_id.

Question 5

What is the difference between run_at scheduling and recurring jobs?

Accepted Answer

run_at scheduling: a one-time execution at a specific future time. enqueue(..., run_at=tomorrow_9am) creates one job row; it executes once when the worker polls and finds run_at <= NOW(). Use for: send reminder email in 3 days, retry a failed payment tomorrow, generate a monthly report on the 1st. Recurring jobs: defined by a cron expression; a scheduler mints new job instances on each tick. The RecurringJob table stores the template; the scheduler calls enqueue() at each scheduled time, creating a new Job row each time. Use for: nightly database cleanup, hourly data sync, daily digest email. Key difference: run_at jobs are finite (one execution), recurring jobs are indefinite (run until the RecurringJob.is_active is set to FALSE). Anti-pattern: implementing recurring behavior by having a job re-enqueue itself on completion — this creates a chain of single-use jobs with no central control point. Use RecurringJob instead for cleaner observability and easier pause/resume control.

Job Queue System Low-Level Design: Worker Pool, Retry Backoff, Heartbeat, and Recurring Jobs

Job Queue System: Low-Level Design

Core Data Model

Job Enqueue and Dedup

Worker: Claim and Execute

Recurring Job Scheduler

Key Design Decisions