Block Storage System Low-Level Design: Volume Management, iSCSI Protocol, Snapshots, and Thin Provisioning

A block storage system exposes raw storage as fixed-size blocks (typically 4 KB or 512 bytes) arranged into logical volumes. The compute layer (virtual machine, container) uses block storage as if it were a local disk, issuing read/write commands to logical block addresses. Block storage is the foundation of cloud VM disks (AWS EBS, GCP Persistent Disk) and enterprise SAN systems.

Volume Abstraction

A logical volume presents a contiguous block address space (e.g., 0 to N-1 blocks) to the compute layer, while the storage system maps logical block addresses to physical locations on storage nodes. This indirection enables:

  • Thin provisioning: allocate physical blocks only on first write, not at volume creation.
  • Snapshots: preserve point-in-time state without duplicating all data.
  • Replication: mirror writes to multiple physical locations transparently.
  • Live migration: move volume data between storage nodes without downtime.

The mapping table (VolumeBlock) records: logical block address → (physical storage node, physical offset). For thin-provisioned volumes, unmapped logical blocks have no entry and are treated as zeros.

Thin Provisioning

At volume creation, only metadata is allocated — no physical storage blocks. On the first write to a logical block:

  1. Storage system checks VolumeBlock: no entry found (unallocated).
  2. A physical block is allocated from the free pool on the target storage node.
  3. VolumeBlock entry is created: logical_block → (physical_node, physical_offset).
  4. Data is written to the allocated physical block.

Subsequent writes to the same logical block update the data in-place; no re-allocation occurs. Reads to unallocated blocks return zeros without physical I/O.

Over-provisioning risk: if many thin-provisioned volumes grow beyond their expected usage, the physical pool can exhaust. The storage system must monitor physical utilisation and alert before exhaustion.

Copy-on-Write Snapshots

A snapshot captures the state of a volume at a point in time without copying all data. Instead, unchanged blocks are shared between the volume and the snapshot (copy-on-write semantics):

  1. Snapshot is created: snapshot metadata points to the same block map as the volume.
  2. A write arrives for logical block B:
    • Check if block B has already been CoW-copied for this snapshot. If yes, write in place.
    • If no: read original block B, write it to the snapshot's CoW store, update snapshot block map to point to the saved copy, then write new data to the volume's current block B.
  3. The snapshot now owns an immutable copy of the original block B; the volume has the new data.

Incremental snapshots: each snapshot records only the blocks that changed since the previous snapshot (the diff). Unchanged blocks are referenced via pointer to the parent snapshot. Snapshot chains allow efficient incremental backup but increase restore complexity.

iSCSI Protocol

iSCSI (Internet SCSI) encapsulates SCSI commands in TCP packets, enabling block storage over standard Ethernet:

  • Initiator: the compute node (iSCSI client); discovers and connects to targets via iSCSI login.
  • Target: the storage node; exposes one or more LUNs (logical volumes) to initiators.
  • SCSI PDUs: read (READ(10), READ(16)) and write (WRITE(10), WRITE(16)) commands wrapped in iSCSI PDUs transported over TCP.
  • Multi-path I/O: initiator maintains multiple TCP connections to different target ports; I/O is load-balanced across paths for performance and failover.

Replication

Synchronous replication: write is committed to the primary and all replicas before ACK is returned. Zero RPO. Adds latency proportional to round-trip time to the farthest replica. Suitable for same-datacenter or metro-area deployments.

Asynchronous replication: write is committed to the primary, ACK returned, and replication happens in the background. Low write latency but non-zero RPO (data loss up to the replication lag on failure). Suitable for cross-region disaster recovery replicas.

RAID Levels

  • RAID-0 (striping): data striped across N disks; bandwidth scales as N*disk_bandwidth; no redundancy — any disk failure causes data loss. Used for scratch space or in-volume striping above a replicated layer.
  • RAID-1 (mirroring): each block written to 2 disks; 2x write overhead; any single disk failure tolerated; read bandwidth can scale to 2x with parallel reads.
  • RAID-5 (striping + distributed parity): data and parity striped across N disks; tolerates 1 disk failure; storage overhead 1/N of a disk; write amplification for parity updates (read-modify-write per stripe on small writes).
  • RAID-6: like RAID-5 but with 2 parity disks; tolerates 2 simultaneous disk failures.

SQL DDL

CREATE TABLE Volume (
    id                BIGSERIAL PRIMARY KEY,
    name              VARCHAR(255)  NOT NULL UNIQUE,
    size_bytes        BIGINT        NOT NULL,
    provisioned_bytes BIGINT        NOT NULL DEFAULT 0,  -- thin: allocated so far
    snapshot_id       BIGINT,       -- active snapshot (FK to Snapshot)
    status            VARCHAR(32)   NOT NULL DEFAULT 'available',
    created_at        TIMESTAMPTZ   NOT NULL DEFAULT now()
);

-- Block address map (sparse for thin-provisioned volumes)
CREATE TABLE VolumeBlock (
    volume_id       BIGINT        NOT NULL REFERENCES Volume(id),
    logical_block   BIGINT        NOT NULL,
    physical_node   VARCHAR(128)  NOT NULL,
    physical_offset BIGINT        NOT NULL,
    PRIMARY KEY (volume_id, logical_block)
);

CREATE TABLE Snapshot (
    id                BIGSERIAL PRIMARY KEY,
    volume_id         BIGINT        NOT NULL REFERENCES Volume(id),
    parent_snapshot_id BIGINT       REFERENCES Snapshot(id),
    created_at        TIMESTAMPTZ   NOT NULL DEFAULT now(),
    size_bytes        BIGINT        NOT NULL DEFAULT 0
);

Python: Core Operations

import hashlib
import os
from typing import Optional

BLOCK_SIZE = 4096  # 4 KB blocks

# In-memory simulation: volume_id -> {logical_block -> data}
_volumes: dict[int, dict] = {}
_snapshots: dict[int, dict] = {}  # snapshot_id -> {logical_block -> data}
_snapshot_of: dict[int, int] = {}  # snapshot_id -> volume_id

def create_volume(name: str, size_gb: int, thin: bool = True) -> int:
    """Create a thin or thick provisioned volume."""
    vol_id = len(_volumes) + 1
    if thin:
        _volumes[vol_id] = {}  # empty map; physical blocks allocated on first write
    else:
        block_count = (size_gb * 1024 * 1024 * 1024) // BLOCK_SIZE
        _volumes[vol_id] = {i: b'x00' * BLOCK_SIZE for i in range(block_count)}
    print(f"Volume {vol_id} ({name}, {size_gb}GB, thin={thin}) created")
    return vol_id

def create_snapshot(volume_id: int) -> int:
    """Create a CoW snapshot of a volume."""
    snap_id = len(_snapshots) + 1
    # Snapshot starts as a shallow copy of the block map (CoW: blocks shared)
    _snapshots[snap_id] = dict(_volumes[volume_id])  # copy references, not data
    _snapshot_of[snap_id] = volume_id
    print(f"Snapshot {snap_id} created for volume {volume_id}")
    return snap_id

def read_block(volume_id: int, logical_block: int) -> bytes:
    """Read a block from a volume. Returns zeros for unallocated thin blocks."""
    return _volumes.get(volume_id, {}).get(logical_block, b'x00' * BLOCK_SIZE)

def write_block(volume_id: int, logical_block: int, data: bytes) -> None:
    """Write a block, performing CoW for any active snapshot."""
    assert len(data) == BLOCK_SIZE, f"Block must be {BLOCK_SIZE} bytes"
    # CoW: if any snapshot references the current block, save original first
    for snap_id, snap_vol_id in _snapshot_of.items():
        if snap_vol_id == volume_id:
            if logical_block not in _snapshots[snap_id]:
                # Save original block into snapshot before overwriting
                original = _volumes.get(volume_id, {}).get(logical_block, b'x00' * BLOCK_SIZE)
                _snapshots[snap_id][logical_block] = original
    # Write new data to volume
    if volume_id not in _volumes:
        _volumes[volume_id] = {}
    _volumes[volume_id][logical_block] = data

Design Considerations Summary

  • Thin vs thick provisioning: thin for flexibility and storage efficiency; thick for guaranteed performance and no allocation latency.
  • CoW snapshot overhead: first write to any snapshotted block incurs extra read + write; deep snapshot chains compound this effect.
  • iSCSI vs NVMe-oF: iSCSI on commodity Ethernet for cost-effective SAN; NVMe-oF with RDMA for near-local NVMe latency in high-performance clusters.
  • Synchronous vs async replication: synchronous for zero RPO in same DC; async for geo-distributed DR with acceptable RPO.
  • RAID: RAID-5/6 for storage efficiency with parity; RAID-1 for simplicity and fast degraded reads; avoid RAID-5 with large HDDs due to high rebuild failure probability.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top