Question 1

How does a Redis sorted set work as a task scheduler queue?

Accepted Answer

A Redis sorted set (ZADD) stores tasks with their scheduled unix timestamp as the score. The scheduler polls with ZRANGEBYSCORE queue 0 {now} LIMIT 0 10 to fetch tasks due now or in the past. These are atomically removed with ZREM after pickup (use a Lua script for atomicity). This is called a time-wheel pattern: O(log N) insert, O(log N + K) fetch for K tasks. It handles millions of scheduled tasks efficiently. The scheduler runs in a leader node (chosen via distributed lock) to prevent double-execution. Tasks not yet due remain in the sorted set with future timestamps. For recurring tasks, after executing, the next occurrence is re-inserted with the next timestamp computed from the cron expression.

Question 2

How do you guarantee exactly-once execution in a distributed task scheduler?

Accepted Answer

Exactly-once requires two components: (1) Leader election via distributed lock — only one scheduler node dequeues tasks at a time, preventing multiple nodes from picking up the same task. Use Redis SET NX EX or etcd leader lease. If the leader crashes, a new leader acquires the lock and resumes. (2) Idempotent task execution — even with exactly-once dequeue, a worker crash after dequeue but before completion can cause re-execution on retry. Tag each task with a unique execution_id. Workers write to an execution_log table with a UNIQUE constraint on execution_id. If the task is retried, the second INSERT fails (ON CONFLICT DO NOTHING), and the task is skipped. True exactly-once delivery is impossible across distributed systems without idempotent consumers; the combination of dedup + unique execution_id gives effectively-once semantics.

Question 3

How do you implement task dependency DAGs in a scheduler like Airflow?

Accepted Answer

Task dependencies form a Directed Acyclic Graph (DAG). Each task has a list of upstream dependencies. A task is eligible to run only when all upstream tasks have completed successfully. Implementation: maintain a task_executions table with status (pending/running/success/failed) and a dependencies table linking child_task_id to parent_task_id. A scheduler loop queries: SELECT task_id FROM task_executions WHERE status=pending AND NOT EXISTS (SELECT 1 FROM dependencies d JOIN task_executions p ON d.parent_id=p.task_id WHERE d.child_id=task_id AND p.status != success). When a task completes, the scheduler checks which downstream tasks are now unblocked. For fan-out parallelism: once all predecessors succeed, all eligible successors can be enqueued simultaneously. Cycle detection at DAG definition time (topological sort) prevents deadlocks at runtime.

System Design Interview: Design a Task Scheduling System (Cron/Airflow)

What Is a Task Scheduling System?

System Requirements

Functional

Non-Functional

Core Data Model

Scheduling Architecture

Polling vs Event-Driven Trigger

Exactly-Once Execution

Cron Expression Parsing

DAG Dependencies (Airflow-style)

Retry and Backoff

Interview Tips