System Design: Infrastructure as Code — Terraform, Pulumi, CloudFormation, GitOps, Drift Detection, State Management

Infrastructure as Code (IaC) manages cloud infrastructure through declarative configuration files rather than manual console clicks. IaC enables version control, code review, automated testing, and reproducible environments for your infrastructure. This guide covers Terraform, Pulumi, CloudFormation, and GitOps practices — essential knowledge for DevOps, platform engineering, and SRE interviews.

Why Infrastructure as Code

Manual infrastructure management (clicking through the AWS console) is error-prone, unreproducible, and unauditable. IaC solves these problems: (1) Version control — infrastructure configuration is stored in git alongside application code. Every change is tracked, reviewed, and reversible. (2) Reproducibility — create identical environments (staging, production) from the same code. No configuration drift between environments. (3) Code review — infrastructure changes go through pull requests with peer review before applying. A teammate catches the misconfigured security group before it reaches production. (4) Automation — CI/CD pipelines automatically validate and apply infrastructure changes. No human needs to log into the console. (5) Documentation — the code IS the documentation. Want to know what infrastructure exists? Read the Terraform files. (6) Blast radius control — test infrastructure changes in a staging environment before production. Roll back by reverting a commit. Without IaC, teams accumulate “snowflake” servers — manually configured, undocumented, impossible to reproduce. When that server fails at 3 AM, rebuilding it from memory is a nightmare.

Terraform: The Multi-Cloud Standard

Terraform (HashiCorp) is the most popular IaC tool. It uses HCL (HashiCorp Configuration Language) to define infrastructure declaratively. You describe the desired state: “I want a VPC with 3 subnets, an RDS instance, and an ECS cluster.” Terraform computes the difference between the desired state and the current state (the plan) and applies only the necessary changes. Core workflow: (1) terraform init — initialize the working directory, download provider plugins (AWS, GCP, Azure). (2) terraform plan — show what changes will be made without applying them. Review the plan to verify it matches expectations. (3) terraform apply — apply the changes to create, modify, or destroy resources. State management: Terraform stores the current state of managed resources in a state file (terraform.tfstate). This file maps resource configurations to real infrastructure IDs. Store the state in a remote backend (S3 + DynamoDB for locking, Terraform Cloud) — never commit it to git (it may contain secrets). State locking prevents two engineers from applying changes simultaneously. Modules: reusable packages of Terraform configuration. A VPC module encapsulates the VPC, subnets, route tables, and NAT gateways. Teams use modules to enforce standards and reduce duplication.

Pulumi vs Terraform

Pulumi uses general-purpose programming languages (TypeScript, Python, Go, Java) instead of a domain-specific language. This enables: loops, conditionals, functions, classes, and abstractions that are awkward in HCL. Example: create 10 identical VMs with a for loop, or define a reusable component as a class with typed inputs and outputs. Pulumi advantages: (1) Full programming language — use the language your team already knows. No learning a new DSL. (2) Testing — write unit tests for infrastructure using standard test frameworks (Jest, pytest, Go testing). Mock cloud resources and verify the configuration programmatically. (3) IDE support — autocomplete, type checking, refactoring, and documentation from your IDE. Terraform advantages: (1) Larger ecosystem — more providers, more modules, more community examples. (2) HCL is purpose-built for infrastructure — its limitations (no general-purpose programming) prevent over-engineering. Simple infrastructure stays simple. (3) Wider adoption — more engineers know Terraform, easier to hire. CloudFormation (AWS-only): tightly integrated with AWS. Supports every AWS resource on launch day (Terraform providers lag). JSON or YAML syntax. Best for: AWS-only shops that want native integration. Worst for: multi-cloud or complex infrastructure (verbose, limited logic).

GitOps for Infrastructure

GitOps extends IaC by making git the single source of truth for both application and infrastructure configuration. Principles: (1) Declarative configuration — the entire system state is described declaratively in git. (2) Version controlled — git history is the audit log. Every change is a commit with an author, timestamp, and message. (3) Automated reconciliation — an agent continuously compares the actual state with the desired state in git and corrects any drift. (4) Pull-based deployment — the agent pulls changes from git, rather than CI pushing changes to the cluster. This is more secure (the agent has credentials, not CI). Tools: ArgoCD and Flux are the standard GitOps controllers for Kubernetes. ArgoCD watches a git repository containing Kubernetes manifests (or Helm charts, Kustomize). When a change is merged, ArgoCD detects the difference and applies it to the cluster. If someone manually changes a resource (kubectl edit), ArgoCD detects the drift and reverts it to match git. Workflow: developer commits infrastructure change -> PR review -> merge to main -> ArgoCD detects change -> applies to cluster -> verifies health. Rollback: revert the git commit, ArgoCD applies the previous state.

Drift Detection and Remediation

Infrastructure drift occurs when the actual state of resources diverges from the declared configuration. Causes: manual console changes (an engineer clicks “modify” in the AWS console), external processes (auto-scaling changes instance counts), and resource dependencies (an upstream change affects downstream resources). Drift is dangerous: the infrastructure code says one thing, reality is different. The next terraform apply may undo manual changes that were critical fixes. Detection: (1) Terraform plan as drift detection — run terraform plan regularly (scheduled CI job). If the plan shows changes, someone modified infrastructure outside of Terraform. (2) AWS Config — continuously monitors AWS resource configurations and alerts on changes that deviate from rules. (3) ArgoCD sync status — ArgoCD shows “OutOfSync” when the cluster state differs from git. Remediation: (1) Prevent drift — restrict console access. Use IAM policies that allow read-only console access. All changes go through Terraform/git. (2) Detect and alert — run terraform plan in CI daily and alert on drift. (3) Auto-remediate — ArgoCD can automatically sync (revert manual changes to match git). Enable with caution — auto-remediation may revert an emergency manual fix.

Scroll to Top