How does the Kubernetes operator reconciliation loop work?

The reconciliation loop runs continuously: (1) Watch -- the controller watches for changes to the custom resource and related resources (Pods, Services, ConfigMaps). (2) Reconcile -- on change or periodically: read desired state (spec), observe actual state (query Kubernetes), compute diff, take actions to converge (create/update/delete resources). (3) Requeue -- if not complete (waiting for a pod to be ready), requeue with a delay and check again. Critical property: idempotency. The reconcile function must produce the same result regardless of how many times it is called. Always compare desired vs actual state and take only necessary actions. Never assume previous state: if replicas desired=3 and actual=2, create 1 pod (do not assume a scale event). Events may arrive out of order, be duplicated, or be coalesced. The observe-diff-act pattern handles all these cases correctly.

How do operators handle complex multi-step operations like database scaling?

Model complex operations as a state machine. Example -- scaling PostgreSQL from 2 to 3 replicas: State 1 SCALING_UP: create new Pod+PVC, wait for Running. State 2 SYNCING: monitor replication lag, wait until requeue with backoff. Permanent errors (invalid config) -> set Error condition, stop retrying. Degraded state (unhealthy replica) -> set Degraded condition, attempt automatic recovery.

System Design: Design a Kubernetes Operator — Custom Controller, CRD, Reconciliation Loop, Operator SDK, State Machine

⏱ 6 min read

Kubernetes Operators extend Kubernetes to manage complex applications (databases, message queues, ML platforms) using custom resources and controllers. Designing an operator tests your understanding of the Kubernetes control loop pattern, custom resource definitions, idempotent reconciliation, and the challenges of encoding operational knowledge into software. This guide covers operator architecture for infrastructure and platform engineering interviews.

What is a Kubernetes Operator

An operator is a custom controller that watches for changes to custom resources (CRDs) and takes actions to make the actual state match the desired state. It encodes human operational knowledge into software: “when scaling up a PostgreSQL cluster, first add a replica, wait for it to sync, then update the load balancer.” Without an operator: a human reads runbooks and manually performs these steps. With an operator: the human changes the desired state (spec.replicas: 3 -> spec.replicas: 5) and the operator handles the complex multi-step process automatically. Examples: (1) PostgreSQL Operator (Zalando) — manages PostgreSQL clusters: provisioning, replication, failover, backup, and scaling. (2) Strimzi Kafka Operator — manages Kafka clusters: broker deployment, topic creation, user management, and rolling upgrades. (3) Prometheus Operator — manages Prometheus instances, ServiceMonitors, and alerting rules. (4) cert-manager — manages TLS certificates: issuance, renewal, and DNS challenge handling. The operator pattern is Kubernetes most powerful extensibility mechanism — it turns Kubernetes into a platform for managing anything, not just containers.

Custom Resource Definition (CRD)

A CRD defines a new resource type in the Kubernetes API. Example — a PostgresCluster CRD: apiVersion: postgres.example.com/v1. kind: PostgresCluster. spec: version: “16”, replicas: 3, storage: {size: “100Gi”, storageClass: “ssd”}, backup: {schedule: “0 2 * * *”, retention: “7d”}. The CRD registers this resource type with the Kubernetes API server. Users create PostgresCluster resources using kubectl apply, just like native Kubernetes resources. The API server validates the resource against the CRD schema (OpenAPI v3 validation), stores it in etcd, and notifies watching controllers. CRD design best practices: (1) Spec = desired state (what the user wants). Status = actual state (what the operator observes). Separate concerns: users write spec, the operator writes status. (2) Sensible defaults — omitted fields should have reasonable defaults (backup enabled by default, monitoring enabled). (3) Validation — use CRD validation to reject invalid configurations before the operator sees them (replicas must be >= 1, version must be a supported version). (4) Status conditions — standardize status reporting: type (Ready, Synced, Degraded), status (True/False/Unknown), reason, message, lastTransitionTime.

Reconciliation Loop

The reconciliation loop is the heart of an operator. It runs continuously: (1) Watch — the controller watches for changes to the custom resource and related resources (Pods, Services, ConfigMaps created by the operator). (2) Reconcile — when a change is detected (or periodically), the reconcile function is called with the resource key. It: reads the current desired state (the spec), observes the actual state (query Kubernetes for the pods, services, etc.), computes the diff (what needs to change), and takes action to converge (create pods, update configs, delete old resources). (3) Requeue — if the reconciliation is not complete (e.g., waiting for a pod to become ready), requeue with a delay (e.g., 30 seconds) and check again. Idempotency: the reconcile function must be idempotent — calling it multiple times with the same input produces the same result. This is critical because: the reconcile function may be called multiple times for the same event (retries on error), events may arrive out of order, and multiple events may be coalesced into one reconcile call. Pattern: always compare desired vs actual state and take only the necessary actions. Do not assume the previous state — always observe. Example: “if replicas desired = 3 and actual = 2, create 1 pod” (not “create a pod because we received a scale event”).

State Machine for Complex Operations

Complex operations (scaling a database, performing a rolling upgrade, running a backup) have multiple steps that must execute in order. Model these as a state machine in the operator. Example — scaling up a PostgreSQL cluster from 2 to 3 replicas: State 1: SCALING_UP — create the new Pod and PVC. Wait for the Pod to be Running. State 2: SYNCING — the new replica syncs data from the primary (streaming replication). Monitor the replication lag. Wait until lag < 1 second. State 3: UPDATING_LB — add the new replica to the read load balancer (update the Service endpoints or a custom routing resource). State 4: READY — all replicas are healthy and serving traffic. Update the CRD status: replicas: 3, readyReplicas: 3, conditions: [{type: Ready, status: True}]. Each reconciliation checks the current state and advances to the next when conditions are met. If a state fails (Pod crashes during sync), the operator retries or rolls back. The state is stored in the CRD status (status.phase: "Syncing") so it persists across operator restarts. Error handling: (1) Transient errors (API server timeout) — requeue with exponential backoff. (2) Permanent errors (invalid configuration) — set a condition: {type: Error, reason: "InvalidVersion"} and stop retrying. The user must fix the spec. (3) Degraded state (one replica is unhealthy) — set condition: {type: Degraded, reason: "ReplicaUnhealthy"} and attempt automatic recovery.

Operator SDK and Frameworks

Building an operator from scratch requires significant boilerplate: setting up informers, work queues, leader election, and metrics. Frameworks handle this: (1) Operator SDK (Red Hat) — scaffolding and libraries for Go, Ansible, and Helm-based operators. Generates the project structure, CRD, and reconcile skeleton. Integrates with OLM (Operator Lifecycle Manager) for installation and upgrades. (2) Kubebuilder — the Go framework underlying Operator SDK. Generates controllers, webhooks, and tests. Uses controller-runtime library. Most production Go operators use Kubebuilder or Operator SDK. (3) Kopf (Kubernetes Operator Pythonic Framework) — for Python operators. Simpler than Go-based frameworks. Good for operators that primarily orchestrate external APIs (cloud resources, SaaS integrations). (4) Metacontroller — a framework where you write reconciliation logic as a webhook (any language). Metacontroller handles the Kubernetes machinery; your webhook receives the desired state and returns the desired child resources. Leader election: in an HA deployment, multiple operator replicas run for redundancy. Only one should be active (the leader). The others wait. Kubernetes lease-based leader election ensures only one replica reconciles at a time. If the leader crashes, another replica takes over within seconds.