System Design: Infrastructure as Code — Terraform, Pulumi, CloudFormation, GitOps, Drift Detection, State Management

Infrastructure as Code (IaC) manages cloud infrastructure through declarative configuration files rather than manual console clicks. IaC enables version control, code review, automated testing, and reproducible environments for your infrastructure. This guide covers Terraform, Pulumi, CloudFormation, and GitOps practices — essential knowledge for DevOps, platform engineering, and SRE interviews.

Why Infrastructure as Code

Manual infrastructure management (clicking through the AWS console) is error-prone, unreproducible, and unauditable. IaC solves these problems: (1) Version control — infrastructure configuration is stored in git alongside application code. Every change is tracked, reviewed, and reversible. (2) Reproducibility — create identical environments (staging, production) from the same code. No configuration drift between environments. (3) Code review — infrastructure changes go through pull requests with peer review before applying. A teammate catches the misconfigured security group before it reaches production. (4) Automation — CI/CD pipelines automatically validate and apply infrastructure changes. No human needs to log into the console. (5) Documentation — the code IS the documentation. Want to know what infrastructure exists? Read the Terraform files. (6) Blast radius control — test infrastructure changes in a staging environment before production. Roll back by reverting a commit. Without IaC, teams accumulate “snowflake” servers — manually configured, undocumented, impossible to reproduce. When that server fails at 3 AM, rebuilding it from memory is a nightmare.

Terraform: The Multi-Cloud Standard

Terraform (HashiCorp) is the most popular IaC tool. It uses HCL (HashiCorp Configuration Language) to define infrastructure declaratively. You describe the desired state: “I want a VPC with 3 subnets, an RDS instance, and an ECS cluster.” Terraform computes the difference between the desired state and the current state (the plan) and applies only the necessary changes. Core workflow: (1) terraform init — initialize the working directory, download provider plugins (AWS, GCP, Azure). (2) terraform plan — show what changes will be made without applying them. Review the plan to verify it matches expectations. (3) terraform apply — apply the changes to create, modify, or destroy resources. State management: Terraform stores the current state of managed resources in a state file (terraform.tfstate). This file maps resource configurations to real infrastructure IDs. Store the state in a remote backend (S3 + DynamoDB for locking, Terraform Cloud) — never commit it to git (it may contain secrets). State locking prevents two engineers from applying changes simultaneously. Modules: reusable packages of Terraform configuration. A VPC module encapsulates the VPC, subnets, route tables, and NAT gateways. Teams use modules to enforce standards and reduce duplication.

Pulumi vs Terraform

Pulumi uses general-purpose programming languages (TypeScript, Python, Go, Java) instead of a domain-specific language. This enables: loops, conditionals, functions, classes, and abstractions that are awkward in HCL. Example: create 10 identical VMs with a for loop, or define a reusable component as a class with typed inputs and outputs. Pulumi advantages: (1) Full programming language — use the language your team already knows. No learning a new DSL. (2) Testing — write unit tests for infrastructure using standard test frameworks (Jest, pytest, Go testing). Mock cloud resources and verify the configuration programmatically. (3) IDE support — autocomplete, type checking, refactoring, and documentation from your IDE. Terraform advantages: (1) Larger ecosystem — more providers, more modules, more community examples. (2) HCL is purpose-built for infrastructure — its limitations (no general-purpose programming) prevent over-engineering. Simple infrastructure stays simple. (3) Wider adoption — more engineers know Terraform, easier to hire. CloudFormation (AWS-only): tightly integrated with AWS. Supports every AWS resource on launch day (Terraform providers lag). JSON or YAML syntax. Best for: AWS-only shops that want native integration. Worst for: multi-cloud or complex infrastructure (verbose, limited logic).

GitOps for Infrastructure

GitOps extends IaC by making git the single source of truth for both application and infrastructure configuration. Principles: (1) Declarative configuration — the entire system state is described declaratively in git. (2) Version controlled — git history is the audit log. Every change is a commit with an author, timestamp, and message. (3) Automated reconciliation — an agent continuously compares the actual state with the desired state in git and corrects any drift. (4) Pull-based deployment — the agent pulls changes from git, rather than CI pushing changes to the cluster. This is more secure (the agent has credentials, not CI). Tools: ArgoCD and Flux are the standard GitOps controllers for Kubernetes. ArgoCD watches a git repository containing Kubernetes manifests (or Helm charts, Kustomize). When a change is merged, ArgoCD detects the difference and applies it to the cluster. If someone manually changes a resource (kubectl edit), ArgoCD detects the drift and reverts it to match git. Workflow: developer commits infrastructure change -> PR review -> merge to main -> ArgoCD detects change -> applies to cluster -> verifies health. Rollback: revert the git commit, ArgoCD applies the previous state.

Drift Detection and Remediation

Infrastructure drift occurs when the actual state of resources diverges from the declared configuration. Causes: manual console changes (an engineer clicks “modify” in the AWS console), external processes (auto-scaling changes instance counts), and resource dependencies (an upstream change affects downstream resources). Drift is dangerous: the infrastructure code says one thing, reality is different. The next terraform apply may undo manual changes that were critical fixes. Detection: (1) Terraform plan as drift detection — run terraform plan regularly (scheduled CI job). If the plan shows changes, someone modified infrastructure outside of Terraform. (2) AWS Config — continuously monitors AWS resource configurations and alerts on changes that deviate from rules. (3) ArgoCD sync status — ArgoCD shows “OutOfSync” when the cluster state differs from git. Remediation: (1) Prevent drift — restrict console access. Use IAM policies that allow read-only console access. All changes go through Terraform/git. (2) Detect and alert — run terraform plan in CI daily and alert on drift. (3) Auto-remediate — ArgoCD can automatically sync (revert manual changes to match git). Enable with caution — auto-remediation may revert an emergency manual fix.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is Infrastructure as Code and why is it important?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Infrastructure as Code (IaC) manages cloud infrastructure through declarative configuration files instead of manual console operations. Benefits: (1) Version control — infrastructure changes are tracked in git with full history, authorship, and the ability to revert. (2) Reproducibility — create identical environments from the same code. No configuration drift between staging and production. (3) Code review — infrastructure changes go through pull requests with peer review before applying. (4) Automation — CI/CD pipelines validate and apply changes automatically. (5) Documentation — the code IS the documentation of what infrastructure exists. Without IaC, teams accumulate snowflake servers that are manually configured, undocumented, and impossible to reproduce. When a critical server fails at 3 AM, rebuilding from memory is unreliable and slow. IaC ensures any environment can be recreated from code in minutes.”}},{“@type”:”Question”,”name”:”What is the difference between Terraform and Pulumi?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Terraform uses HCL (HashiCorp Configuration Language), a domain-specific declarative language. Pulumi uses general-purpose programming languages (TypeScript, Python, Go, Java). Terraform advantages: largest ecosystem (more providers, modules, community examples), HCL prevents over-engineering by limiting expressiveness, wider industry adoption (easier hiring). Pulumi advantages: use a language your team already knows (no learning a new DSL), full programming constructs (loops, conditionals, functions, classes for reusable infrastructure components), unit testing with standard frameworks (Jest, pytest), and IDE support (autocomplete, type checking). Both produce a desired-state plan and apply changes incrementally. Both support multi-cloud. Choose Terraform for teams that value simplicity and ecosystem breadth. Choose Pulumi for teams that want to leverage existing programming language expertise and need complex infrastructure abstractions. CloudFormation is AWS-only: tightly integrated with AWS, supports every AWS resource on launch day, but verbose and limited to JSON/YAML.”}},{“@type”:”Question”,”name”:”What is GitOps and how does ArgoCD implement it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”GitOps makes git the single source of truth for infrastructure and application configuration. An agent continuously reconciles the actual state of the system with the desired state declared in git. ArgoCD is the standard GitOps controller for Kubernetes. It watches a git repository containing Kubernetes manifests (plain YAML, Helm charts, or Kustomize). When a change is merged to the main branch, ArgoCD detects the difference between the git state and the cluster state, and applies the changes to bring the cluster in sync. Key benefits: (1) Pull-based deployment — ArgoCD pulls from git rather than CI pushing to the cluster. Only ArgoCD needs cluster credentials, not the CI system. (2) Drift detection — if someone manually changes a Kubernetes resource (kubectl edit), ArgoCD detects the drift and can auto-revert to match git. (3) Audit trail — every change is a git commit with author and timestamp. (4) Rollback — revert a git commit and ArgoCD applies the previous state. Workflow: developer commits change, PR review, merge to main, ArgoCD syncs to cluster, health check verifies success.”}},{“@type”:”Question”,”name”:”How do you handle Terraform state management safely?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Terraform state (terraform.tfstate) maps your configuration to real infrastructure IDs. It is the most critical file in your Terraform workflow. Safe state management: (1) Remote backend — store state in S3 (with versioning enabled) + DynamoDB for state locking. Never store state in git (it may contain secrets like database passwords). Terraform Cloud and Terraform Enterprise also provide managed state storage. (2) State locking — DynamoDB (or equivalent) prevents two engineers from running terraform apply simultaneously. Without locking, concurrent applies can corrupt state or create duplicate resources. (3) State encryption — enable server-side encryption on the S3 bucket. State files contain resource attributes that may include sensitive values. (4) Workspaces or separate state files per environment — staging and production should have separate state files. A mistake in staging should never affect the production state. (5) State backup — S3 versioning provides automatic backups. If state is corrupted, restore a previous version. (6) Import existing resources — if infrastructure was created manually, use terraform import to bring it under Terraform management without recreating it.”}}]}
Scroll to Top