Event-driven architecture (EDA) is the backbone of modern distributed systems. Instead of services calling each other synchronously, they communicate by producing and consuming events. This decouples services, improves scalability, and enables powerful patterns like event sourcing and CQRS. This guide covers the practical implementation of event-driven systems using Apache Kafka — essential knowledge for system design interviews at top tech companies.
Events vs Commands vs Queries
Three types of messages in a distributed system: (1) Commands — requests to perform an action: PlaceOrder, CancelSubscription, SendEmail. A command is directed at a specific service and expects a result (success or failure). Commands can be rejected. (2) Events — notifications that something has happened: OrderPlaced, PaymentProcessed, UserRegistered. Events are facts — they describe something that already occurred and cannot be rejected. Events are broadcast to any interested consumer. (3) Queries — requests for information: GetOrderStatus, ListUserOrders. Queries do not change state. The critical distinction: commands are imperative (“do this”), events are declarative (“this happened”). A service that receives an OrderPlaced event decides independently what to do with it — the producing service does not know or care. This is the source of decoupling in event-driven systems.
Apache Kafka as the Event Backbone
Kafka is a distributed, durable, high-throughput event streaming platform. Core concepts: (1) Topics — named categories of events. order-events, payment-events, user-events. (2) Partitions — a topic is split into partitions for parallelism. Each partition is an ordered, immutable append-only log. Events within a partition are strictly ordered; events across partitions have no ordering guarantee. (3) Producers — applications that write events to topics. The producer chooses the partition: by key (all events for the same order_id go to the same partition, preserving per-entity ordering) or round-robin (for maximum throughput without ordering). (4) Consumers — applications that read events from topics. Consumer groups enable parallel consumption: each partition is assigned to exactly one consumer in the group. A 12-partition topic with 4 consumers in a group: each consumer handles 3 partitions. (5) Retention — Kafka retains events for a configurable period (7 days default) or indefinitely with log compaction. Events can be replayed from any offset.
Event Sourcing
Event sourcing stores the state of an entity as a sequence of events rather than the current state. Instead of storing an order with status=SHIPPED, store the sequence: OrderCreated, PaymentReceived, ItemsPacked, OrderShipped. The current state is derived by replaying all events in order. Benefits: (1) Complete audit trail — every state change is recorded. You can reconstruct the state at any point in time by replaying events up to that timestamp. (2) Temporal queries — “what was the order status at 3 PM yesterday?” Replay events up to that time. (3) Event replay — fix a bug in the projection logic, replay all events through the corrected logic, and rebuild the state correctly. (4) Decoupled read models — different consumers can build different views from the same events (an analytics view, a search index, a notification trigger). Implementation: store events in an event store (a database optimized for append-only writes and sequential reads). Each entity has a stream of events identified by stream_id (e.g., order-12345). Read the stream, apply each event to a state object, and return the current state. For performance, periodically create snapshots (the state after N events) to avoid replaying the entire history.
CQRS: Command Query Responsibility Segregation
CQRS separates the write model (commands) from the read model (queries). The write side handles business logic, validation, and state changes. The read side is optimized for queries with denormalized, pre-computed views. In an event-sourced CQRS system: the write side appends events to the event store. The read side consumes these events and updates materialized views (projections). Example: an e-commerce order system. Write side: validates the order, checks inventory, processes payment, emits OrderCreated event. Read side: consumes OrderCreated, updates a denormalized orders_view table with order details, customer info, and item names (pre-joined for fast reads). The read side may have multiple projections: one optimized for the customer order history page, another for the admin dashboard, another feeding an Elasticsearch index for order search. Benefits: independent scaling (reads are typically 10-100x more frequent than writes), optimized data models for each use case, and temporal queries via event replay.
The Saga Pattern for Distributed Transactions
In a microservices architecture, a business operation may span multiple services. Placing an order involves: order service (create order), inventory service (reserve items), payment service (charge card), shipping service (schedule delivery). A traditional distributed transaction (two-phase commit) is impractical across microservices — it requires all participants to be available simultaneously and creates tight coupling. The saga pattern replaces one distributed transaction with a sequence of local transactions, each publishing an event that triggers the next step. Two saga implementations: (1) Choreography — each service listens for events and produces events. Order service emits OrderCreated. Inventory service hears it, reserves items, emits ItemsReserved. Payment service hears it, charges card, emits PaymentProcessed. No central coordinator. Simple for 3-4 steps; becomes hard to follow for complex workflows. (2) Orchestration — a saga orchestrator service coordinates the steps. It sends commands to each service in sequence and handles responses. The orchestrator knows the full workflow, making complex sagas easier to understand and debug.
Compensating Transactions
When a saga step fails, the previously completed steps must be undone. Compensating transactions are the “undo” operations for each step. Example: order saga step 3 (payment) fails. Compensating actions: step 2 compensation — release reserved inventory (inventory service). Step 1 compensation — mark order as cancelled (order service). Compensating transactions are not always a perfect undo — they are semantic reversals. A payment refund is not the same as the payment never happening (the customer credit card was charged and refunded, which may appear on their statement). Design compensations carefully: they must be idempotent (safe to execute multiple times — a network failure during compensation may cause a retry) and must work even if the original operation partially completed. For some operations, compensation is impossible — sending an email cannot be unsent. In these cases, delay the irreversible step to the end of the saga, or send a correction email as the compensation.
Eventual Consistency and Its Implications
Event-driven systems are eventually consistent: after an event is produced, there is a delay before all consumers process it and update their state. The order service creates an order (write side), but the order history page (read side, consuming events via Kafka) may not show it for 100ms-2 seconds. Implications: (1) Read-your-own-writes — after creating an order, the user should see it immediately. Solution: read from the write model for the creating user (bypass the eventually consistent read model for the current session). (2) Stale reads — a dashboard may show slightly outdated data. Acceptable for most use cases; add a “last updated” timestamp to set expectations. (3) Ordering challenges — two events may arrive out of order if they are on different Kafka partitions. Use the same partition key (entity ID) for related events to guarantee ordering within an entity. (4) Idempotent consumers — network issues may cause event redelivery. Consumers must handle duplicate events gracefully. Track processed event IDs in the consumer state to detect and skip duplicates. Eventual consistency is the price of decoupling and scalability — accept it where possible, mitigate it where necessary.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the difference between event sourcing and traditional CRUD?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Traditional CRUD stores the current state of an entity. An order record has status=SHIPPED, amount=99.99. When the status changes, the row is updated in place. The previous state is lost (unless you add audit logging). Event sourcing stores every state change as an immutable event. The order stream contains: OrderCreated(amount=99.99), PaymentReceived, ItemsPacked, OrderShipped. The current state is derived by replaying all events. No data is ever lost — you can reconstruct the state at any point in time. Advantages of event sourcing: complete audit trail (every change is recorded), temporal queries (what was the state at 3 PM?), event replay (fix a bug in projection logic, replay events through corrected code), and multiple read models (different consumers build different views from the same events). Disadvantages: increased complexity (developers must think in events, not state), eventual consistency between the event store and read models, event schema evolution (old events may not match the current schema — must handle versioning), and debugging is harder (the current state is computed, not stored directly). Use event sourcing when audit trails are required (finance, healthcare), when you need temporal queries, or when multiple services need to react to state changes.” } }, { “@type”: “Question”, “name”: “How does the saga pattern handle distributed transactions across microservices?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A saga replaces a single distributed transaction with a sequence of local transactions coordinated by events or a central orchestrator. Each local transaction updates one service and publishes an event or notifies the orchestrator. Two implementations: Choreography-based saga: each service listens for events and reacts. Order service creates an order and emits OrderCreated. Inventory service hears it, reserves stock, emits StockReserved. Payment service hears it, charges the card, emits PaymentProcessed. Shipping service hears it, schedules delivery. No central coordinator — each service knows its part. Simple for 3-4 steps but hard to follow for complex workflows. Orchestration-based saga: a central OrderSaga orchestrator drives the process. It sends ReserveStock command to inventory, waits for response, sends ChargeCard command to payment, waits for response, sends ScheduleDelivery to shipping. The orchestrator has a clear view of the entire workflow, making it easier to debug and extend. If any step fails, the orchestrator triggers compensating transactions in reverse order: refund payment, release stock, cancel order. Each compensating action must be idempotent because retries may occur.” } }, { “@type”: “Question”, “name”: “How does Kafka guarantee message ordering and how do partitions affect it?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Kafka guarantees ordering within a partition but not across partitions. Each partition is an ordered, append-only log. Messages written to a partition are assigned sequential offsets (0, 1, 2, …). A consumer reading from a partition sees messages in exactly the order they were produced. Across partitions: no ordering guarantee. If a topic has 4 partitions, messages produced at time T1 and T2 may arrive in different order depending on which partition each was written to. Partition key determines which partition a message goes to: partition = hash(key) % num_partitions. All messages with the same key go to the same partition and are therefore ordered relative to each other. Design principle: choose a partition key that groups related messages. For an order system, use order_id as the key. All events for order-12345 (OrderCreated, PaymentReceived, ItemsShipped) go to the same partition and arrive in order. Different orders may go to different partitions — they do not need relative ordering. If you need global ordering across all messages, use a topic with a single partition. This limits throughput to one consumer per consumer group but guarantees total order.” } }, { “@type”: “Question”, “name”: “How do you handle eventual consistency in event-driven systems?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Eventual consistency means that after a write, there is a delay before all read models reflect the change. The write side commits to the event store, but the read side (consuming events via Kafka) may lag by 100ms to several seconds. Mitigation strategies: (1) Read-your-own-writes — after a user creates an order, read from the write model (event store) for that specific user session, bypassing the eventually consistent read model. The user sees their order immediately. Other users see it after the read model catches up. (2) Optimistic UI — the frontend assumes the write succeeded and shows the result immediately. If the backend rejects it (detected asynchronously), show an error and revert. Common in modern SPAs. (3) Polling or WebSocket — after a write, the frontend polls the read model or listens via WebSocket for the projected state. Display a loading indicator until the read model catches up. (4) Causal consistency — include a version or sequence number in the response to the write. The client passes this version to the read endpoint. The read endpoint waits until its projection has reached that version before responding. This guarantees the read reflects at least the last write, at the cost of slightly higher read latency.” } } ] }