Event-Driven Architecture System Design

What is Event-Driven Architecture

In event-driven architecture (EDA), services communicate by publishing and consuming events rather than making direct synchronous calls to each other. An event is an immutable record of something that happened: “OrderPlaced”, “PaymentProcessed”, “InventoryReserved”. Services are decoupled – the publisher does not know or care who consumes its events.

The backbone is an event bus or message broker (Kafka, RabbitMQ, AWS EventBridge). Producers write events to topics. Consumers subscribe to topics and process events independently, at their own pace. This enables async workflows, fan-out to multiple consumers, and natural decoupling between teams and services.

Event Sourcing

Instead of storing current state in a mutable row, event sourcing stores every state change as an immutable event in an append-only log. Current state is derived by replaying all events from the beginning (or from a snapshot).

Example: instead of an “accounts” table with a balance column, you store events: “AccountOpened”, “MoneyDeposited(100)”, “MoneyWithdrawn(30)”. Replay these to get the current balance of 70.

Benefits: complete audit log at no extra cost, time travel (replay up to any point in time), ability to rebuild read projections from scratch, and event log as the integration contract between services.

Costs: querying current state requires replaying or maintaining projections, eventual consistency between write and read models, increased operational complexity, and storage growth over time (mitigated by snapshots).

CQRS – Command Query Responsibility Segregation

CQRS separates the write model (commands) from the read model (queries). The write model handles commands (“PlaceOrder”), validates business rules, and emits domain events. The read model is a set of materialized views – denormalized, query-optimized projections built by consuming those events.

A command goes through: API -> Command Handler -> Domain Model -> Event Store -> Event published -> Read Model Projector updates query-optimized view -> future queries hit the read model directly.

This enables independent scaling: reads are typically 10x-100x more frequent than writes, so you can scale read replicas without touching the write side. You can also have multiple read models optimized for different query patterns (search index, dashboard aggregate, mobile feed).

The tradeoff is eventual consistency: after a write, the read model may lag by milliseconds to seconds. For user-facing operations that need immediate read-your-writes consistency, you need either a synchronous projection update or optimistic UI updates on the client.

Choreography vs Orchestration

Two approaches to coordinating multi-step workflows in EDA:

Choreography: each service reacts to events and publishes new events. OrderService publishes “OrderPlaced” -> PaymentService listens, charges card, publishes “PaymentProcessed” -> InventoryService listens, reserves stock, publishes “InventoryReserved” -> ShippingService listens and schedules delivery. No central coordinator. Fully decoupled. The downside: the overall business flow is implicit and scattered across services. Debugging and tracing a failed order requires correlating logs across 4 services.

Orchestration: a central saga orchestrator (a dedicated service or workflow engine like AWS Step Functions, Temporal, or Conductor) sends commands to each service and waits for responses. The orchestrator knows the full business flow explicitly. Easier to understand, trace, and modify. The downside: the orchestrator becomes a central dependency. If it fails, workflows stall.

Choose choreography for simple, stable workflows with few steps. Choose orchestration for complex multi-step workflows where observability and error handling are critical.

Saga Pattern for Distributed Transactions

Distributed systems cannot use ACID transactions across service boundaries. The saga pattern achieves eventual consistency through a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.

Example – placing an order:

OrderService creates order (state: PENDING), publishes “OrderCreated”
PaymentService charges card, publishes “PaymentProcessed” – or on failure publishes “PaymentFailed”
InventoryService reserves stock, publishes “InventoryReserved” – or on failure publishes “InventoryFailed”
ShippingService schedules shipment, publishes “ShipmentScheduled”
OrderService updates order to CONFIRMED

Rollback chain if InventoryFailed: InventoryService publishes “InventoryFailed” -> PaymentService listens and refunds charge, publishes “PaymentRefunded” -> OrderService listens and marks order CANCELLED.

Compensating transactions must be idempotent and must always succeed (they cannot fail). If a compensating transaction itself fails, you need manual intervention or a dead-letter queue. Design compensating transactions before designing the forward path.

Event Schema and Versioning

Events are a public contract. Once consumers depend on an event schema, breaking changes cause failures. Versioning strategy:

Backward-compatible changes (safe): add optional fields with defaults. Consumers that do not read the new field are unaffected.

Breaking changes: publish a new event version (“OrderPlaced.v2”) alongside the old one. Update consumers to handle both. Once all consumers are on v2, deprecate v1. Never mutate the schema of a live event version.

Use a schema registry (Confluent Schema Registry for Kafka, AWS Glue Schema Registry) to enforce compatibility rules at publish time. Avro or Protobuf provide compact binary encoding and built-in schema evolution rules. JSON is more flexible but offers no enforcement.

Idempotent Event Consumers

Message brokers guarantee at-least-once delivery. A consumer may receive the same event multiple times (broker retry, consumer crash after processing but before committing offset). Consumers must be idempotent: processing the same event twice produces the same result as processing it once.

Implementation: store a processed_events table with the event ID. Before processing, check if the ID exists. If yes, skip. If no, process and insert the ID in the same local transaction. This is the deduplication check pattern.

Alternative for naturally idempotent operations: use upserts instead of inserts (“set balance = X” instead of “add Y to balance”). Design operations to be safe to repeat.

Outbox Pattern for Reliable Event Publishing

A common bug: service updates the database and then publishes an event. If the process crashes between the two steps, the DB is updated but the event is never published – silent data inconsistency.

The outbox pattern fixes this: in the same local DB transaction that updates business data, write the event to an “outbox” table. A separate background process (or change data capture via Debezium) reads unprocessed rows from the outbox table and publishes them to the event bus, then marks them processed. The DB transaction guarantees atomicity between the business update and the outbox write. The background publisher handles retries if the broker is temporarily unavailable.

When to Use EDA vs Direct Service Calls

Use EDA when: the workflow is async and does not need an immediate response, multiple consumers need to react to the same event (fan-out), services should be decoupled across team boundaries, workload is bursty and consumers need to process at their own pace.

Use direct calls (REST/gRPC) when: the operation is synchronous and user-facing (user clicks “submit” and waits for confirmation), the workflow has 1-2 steps with no fan-out, strong consistency is required, or the overall system is simple enough that the overhead of a message broker is not justified.

EDA adds real complexity: a message broker to operate, event schema management, idempotency requirements, debugging across async boundaries, and eventual consistency tradeoffs. Apply it where the decoupling and async benefits are concrete, not as a default architecture style.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the Saga pattern and how does it handle distributed transactions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A saga is a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the completed steps. Example: place order saga: (1) Create order (PENDING). (2) Payment service charges card – success: emit PaymentCompleted; failure: emit PaymentFailed. (3) Inventory service reserves items – success: emit InventoryReserved; failure: emit InventoryFailed -> trigger refund. (4) Shipping service creates shipment. Compensating transactions must be idempotent and always succeed (no rollback of rollbacks). Two coordination styles: choreography (each service listens for events and reacts) and orchestration (a central saga orchestrator sends commands and collects responses). Sagas replace distributed ACID transactions which are impractical across microservices.”}},{“@type”:”Question”,”name”:”What is CQRS and when should you use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”CQRS separates the write model (commands) from the read model (queries). The write model processes commands, enforces business rules, and emits domain events. The read model maintains materialized views optimized for specific query patterns, updated by event handlers. Benefits: write and read sides can be scaled independently, read models can be denormalized for fast queries, multiple read models can serve different consumers. Cost: eventual consistency between write and read models, operational complexity (two models to maintain). Use CQRS when: read and write loads differ significantly, multiple query patterns need different data shapes, or you need an audit log of all changes. Do not use for simple CRUD applications – the complexity is not justified.”}},{“@type”:”Question”,”name”:”What is the difference between event sourcing and a traditional database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Traditional database: store the current state of each entity; when state changes, UPDATE the record; no history of how we got to the current state. Event sourcing: store every change as an immutable event; current state is derived by replaying all events for an entity; the event log is append-only. Benefits of event sourcing: complete audit log (who changed what and when), time travel (replay to any past state), rebuild projections (create a new read model by replaying historical events), bug investigation (replay exact sequence of events that led to a bug). Costs: eventual consistency for reads, event log grows over time (use snapshots: periodically save current state so replay only needs to go back to the last snapshot). Event sourcing pairs naturally with CQRS.”}},{“@type”:”Question”,”name”:”How do you version events in an event-driven system?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Events are part of the public API between services. Backward compatible changes are safe: add optional fields with defaults, add new event types. Breaking changes require versioning. Three strategies: (1) Include a version field in the event envelope: {version: "2", type: "OrderPlaced", …}. Consumers handle both v1 and v2 during the migration window. (2) Create a new event type OrderPlacedV2 alongside OrderPlaced. Both are published for a transition period. (3) Upcasting: when the consumer reads an older event, transform it to the current schema on the fly before processing. Strategy 1 is most common. Define an event schema registry (Confluent Schema Registry with Avro) to enforce compatibility rules and prevent breaking changes from being deployed.”}},{“@type”:”Question”,”name”:”When should you use choreography vs orchestration in a saga?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Choreography: each service listens for events and publishes new ones. Services are fully decoupled – they do not know about each other. Good for simple workflows with 2-3 steps. Problem: as the workflow grows, it becomes hard to understand the overall business process from reading individual services. Debugging requires tracing events across services. Orchestration: a central saga orchestrator (a service or state machine) explicitly commands each service and waits for responses. The complete workflow is visible in one place. Easier to reason about, test, and monitor. Creates a central dependency – the orchestrator becomes a bottleneck or single point of failure. Recommendation: use choreography for simple, stable workflows with few steps. Use orchestration for complex workflows with many steps, compensation logic, or frequent changes to the workflow definition.”}}]}

Stripe system design covers event-driven payment workflows. See design patterns for Stripe interview: event-driven payments and saga patterns.

Uber uses event-driven architecture for dispatch and payments. See design patterns for Uber interview: event-driven microservices architecture.

Atlassian uses event-driven architecture for their cloud products. See design patterns for Atlassian interview: event-driven and saga patterns.