System Design: Event-Driven Architecture — Kafka, Event Sourcing, CQRS, Saga Pattern, Eventual Consistency

Event-driven architecture (EDA) is the backbone of modern distributed systems. Instead of services calling each other synchronously, they communicate by producing and consuming events. This decouples services, improves scalability, and enables powerful patterns like event sourcing and CQRS. This guide covers the practical implementation of event-driven systems using Apache Kafka — essential knowledge for system design interviews at top tech companies.

Events vs Commands vs Queries

Three types of messages in a distributed system: (1) Commands — requests to perform an action: PlaceOrder, CancelSubscription, SendEmail. A command is directed at a specific service and expects a result (success or failure). Commands can be rejected. (2) Events — notifications that something has happened: OrderPlaced, PaymentProcessed, UserRegistered. Events are facts — they describe something that already occurred and cannot be rejected. Events are broadcast to any interested consumer. (3) Queries — requests for information: GetOrderStatus, ListUserOrders. Queries do not change state. The critical distinction: commands are imperative (“do this”), events are declarative (“this happened”). A service that receives an OrderPlaced event decides independently what to do with it — the producing service does not know or care. This is the source of decoupling in event-driven systems.

Apache Kafka as the Event Backbone

Kafka is a distributed, durable, high-throughput event streaming platform. Core concepts: (1) Topics — named categories of events. order-events, payment-events, user-events. (2) Partitions — a topic is split into partitions for parallelism. Each partition is an ordered, immutable append-only log. Events within a partition are strictly ordered; events across partitions have no ordering guarantee. (3) Producers — applications that write events to topics. The producer chooses the partition: by key (all events for the same order_id go to the same partition, preserving per-entity ordering) or round-robin (for maximum throughput without ordering). (4) Consumers — applications that read events from topics. Consumer groups enable parallel consumption: each partition is assigned to exactly one consumer in the group. A 12-partition topic with 4 consumers in a group: each consumer handles 3 partitions. (5) Retention — Kafka retains events for a configurable period (7 days default) or indefinitely with log compaction. Events can be replayed from any offset.

Event Sourcing

Event sourcing stores the state of an entity as a sequence of events rather than the current state. Instead of storing an order with status=SHIPPED, store the sequence: OrderCreated, PaymentReceived, ItemsPacked, OrderShipped. The current state is derived by replaying all events in order. Benefits: (1) Complete audit trail — every state change is recorded. You can reconstruct the state at any point in time by replaying events up to that timestamp. (2) Temporal queries — “what was the order status at 3 PM yesterday?” Replay events up to that time. (3) Event replay — fix a bug in the projection logic, replay all events through the corrected logic, and rebuild the state correctly. (4) Decoupled read models — different consumers can build different views from the same events (an analytics view, a search index, a notification trigger). Implementation: store events in an event store (a database optimized for append-only writes and sequential reads). Each entity has a stream of events identified by stream_id (e.g., order-12345). Read the stream, apply each event to a state object, and return the current state. For performance, periodically create snapshots (the state after N events) to avoid replaying the entire history.

CQRS: Command Query Responsibility Segregation

CQRS separates the write model (commands) from the read model (queries). The write side handles business logic, validation, and state changes. The read side is optimized for queries with denormalized, pre-computed views. In an event-sourced CQRS system: the write side appends events to the event store. The read side consumes these events and updates materialized views (projections). Example: an e-commerce order system. Write side: validates the order, checks inventory, processes payment, emits OrderCreated event. Read side: consumes OrderCreated, updates a denormalized orders_view table with order details, customer info, and item names (pre-joined for fast reads). The read side may have multiple projections: one optimized for the customer order history page, another for the admin dashboard, another feeding an Elasticsearch index for order search. Benefits: independent scaling (reads are typically 10-100x more frequent than writes), optimized data models for each use case, and temporal queries via event replay.

The Saga Pattern for Distributed Transactions

In a microservices architecture, a business operation may span multiple services. Placing an order involves: order service (create order), inventory service (reserve items), payment service (charge card), shipping service (schedule delivery). A traditional distributed transaction (two-phase commit) is impractical across microservices — it requires all participants to be available simultaneously and creates tight coupling. The saga pattern replaces one distributed transaction with a sequence of local transactions, each publishing an event that triggers the next step. Two saga implementations: (1) Choreography — each service listens for events and produces events. Order service emits OrderCreated. Inventory service hears it, reserves items, emits ItemsReserved. Payment service hears it, charges card, emits PaymentProcessed. No central coordinator. Simple for 3-4 steps; becomes hard to follow for complex workflows. (2) Orchestration — a saga orchestrator service coordinates the steps. It sends commands to each service in sequence and handles responses. The orchestrator knows the full workflow, making complex sagas easier to understand and debug.

Compensating Transactions

When a saga step fails, the previously completed steps must be undone. Compensating transactions are the “undo” operations for each step. Example: order saga step 3 (payment) fails. Compensating actions: step 2 compensation — release reserved inventory (inventory service). Step 1 compensation — mark order as cancelled (order service). Compensating transactions are not always a perfect undo — they are semantic reversals. A payment refund is not the same as the payment never happening (the customer credit card was charged and refunded, which may appear on their statement). Design compensations carefully: they must be idempotent (safe to execute multiple times — a network failure during compensation may cause a retry) and must work even if the original operation partially completed. For some operations, compensation is impossible — sending an email cannot be unsent. In these cases, delay the irreversible step to the end of the saga, or send a correction email as the compensation.

Eventual Consistency and Its Implications

Event-driven systems are eventually consistent: after an event is produced, there is a delay before all consumers process it and update their state. The order service creates an order (write side), but the order history page (read side, consuming events via Kafka) may not show it for 100ms-2 seconds. Implications: (1) Read-your-own-writes — after creating an order, the user should see it immediately. Solution: read from the write model for the creating user (bypass the eventually consistent read model for the current session). (2) Stale reads — a dashboard may show slightly outdated data. Acceptable for most use cases; add a “last updated” timestamp to set expectations. (3) Ordering challenges — two events may arrive out of order if they are on different Kafka partitions. Use the same partition key (entity ID) for related events to guarantee ordering within an entity. (4) Idempotent consumers — network issues may cause event redelivery. Consumers must handle duplicate events gracefully. Track processed event IDs in the consumer state to detect and skip duplicates. Eventual consistency is the price of decoupling and scalability — accept it where possible, mitigate it where necessary.

Scroll to Top