Block Storage System Low-Level Design: Volume Management, iSCSI Protocol, Snapshots, and Thin Provisioning

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between thin provisioning and thick provisioning?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Thick provisioning allocates all physical storage for a volume immediately at creation time. A 1 TB thick-provisioned volume consumes 1 TB of physical storage even if only 10 GB of data has been written. This guarantees performance (no allocation latency on first write) and avoids capacity surprises, but wastes storage for sparsely used volumes. Thin provisioning allocates physical blocks only as data is written for the first time. A 1 TB thin-provisioned volume might initially consume only a few megabytes of physical storage. The trade-off is that a first write to an unallocated block incurs an allocation overhead, and capacity planning is more complex — if many thin-provisioned volumes grow unexpectedly, the physical pool can run out of space (over-provisioning risk).”
}
},
{
“@type”: “Question”,
“name”: “How does copy-on-write snapshot work and what is its performance overhead?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Copy-on-write (CoW) snapshots preserve the original data when a write occurs on a snapshotted volume. When a block is about to be overwritten, the storage system first reads the original block, copies it to the snapshot store, updates the snapshot's block map to point to the saved copy, and then writes new data to the original location. This means the first write to any snapshotted block incurs a read-then-write overhead (the 'write amplification' of CoW). Subsequent writes to already-snapshotted blocks do not incur copy cost because the original has already been saved. Snapshot chains multiply this effect: if a block has been overwritten N times, restoring the oldest snapshot requires reading N CoW copies in order. This is why deep snapshot chains degrade both write performance and restore time.”
}
},
{
“@type”: “Question”,
“name”: “How does iSCSI work and how does it compare to NVMe-oF?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “iSCSI (Internet Small Computer System Interface) encapsulates SCSI block storage commands in TCP/IP packets. The compute node (initiator) connects to the storage node (target) over a standard TCP connection, discovers LUNs (logical unit numbers), and sends SCSI read/write commands as if talking to a local disk. iSCSI works over standard Ethernet hardware, making it cost-effective but limited by TCP overhead and latency. NVMe-oF (NVMe over Fabrics) exposes NVMe command queues over a high-speed fabric (RDMA over RoCE, Fibre Channel, or TCP). NVMe-oF achieves near-local NVMe latency (sub-100 microsecond) because RDMA bypasses the OS kernel network stack. The trade-off is that NVMe-oF requires specialized fabric hardware (RDMA NICs), while iSCSI runs on commodity Ethernet.”
}
},
{
“@type”: “Question”,
“name”: “When should you use synchronous versus asynchronous replication for block storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Synchronous replication waits for the write to be acknowledged by all replicas before returning success to the caller. This guarantees zero RPO (recovery point objective — no data loss) but adds latency equal to the round-trip time to the farthest replica. Synchronous replication is appropriate for databases and financial workloads where data loss is unacceptable and replicas are in the same data center or metro area (low RTT). Asynchronous replication acknowledges the write after the primary stores it and then replicates in the background. This does not block writes and has lower latency, but creates an RPO gap — data written after the last replicated point is lost if the primary fails. Async is appropriate for disaster recovery replicas in geographically distant regions where synchronous latency would be prohibitive.”
}
}
]
}

A block storage system exposes raw storage as fixed-size blocks (typically 4 KB or 512 bytes) arranged into logical volumes. The compute layer (virtual machine, container) uses block storage as if it were a local disk, issuing read/write commands to logical block addresses. Block storage is the foundation of cloud VM disks (AWS EBS, GCP Persistent Disk) and enterprise SAN systems.

Volume Abstraction

A logical volume presents a contiguous block address space (e.g., 0 to N-1 blocks) to the compute layer, while the storage system maps logical block addresses to physical locations on storage nodes. This indirection enables:

Thin provisioning: allocate physical blocks only on first write, not at volume creation.
Snapshots: preserve point-in-time state without duplicating all data.
Replication: mirror writes to multiple physical locations transparently.
Live migration: move volume data between storage nodes without downtime.

The mapping table (VolumeBlock) records: logical block address → (physical storage node, physical offset). For thin-provisioned volumes, unmapped logical blocks have no entry and are treated as zeros.

Thin Provisioning

At volume creation, only metadata is allocated — no physical storage blocks. On the first write to a logical block:

Storage system checks VolumeBlock: no entry found (unallocated).
A physical block is allocated from the free pool on the target storage node.
VolumeBlock entry is created: logical_block → (physical_node, physical_offset).
Data is written to the allocated physical block.

Subsequent writes to the same logical block update the data in-place; no re-allocation occurs. Reads to unallocated blocks return zeros without physical I/O.

Over-provisioning risk: if many thin-provisioned volumes grow beyond their expected usage, the physical pool can exhaust. The storage system must monitor physical utilisation and alert before exhaustion.

Copy-on-Write Snapshots

A snapshot captures the state of a volume at a point in time without copying all data. Instead, unchanged blocks are shared between the volume and the snapshot (copy-on-write semantics):

Snapshot is created: snapshot metadata points to the same block map as the volume.
A write arrives for logical block B:
- Check if block B has already been CoW-copied for this snapshot. If yes, write in place.
- If no: read original block B, write it to the snapshot's CoW store, update snapshot block map to point to the saved copy, then write new data to the volume's current block B.
The snapshot now owns an immutable copy of the original block B; the volume has the new data.

Incremental snapshots: each snapshot records only the blocks that changed since the previous snapshot (the diff). Unchanged blocks are referenced via pointer to the parent snapshot. Snapshot chains allow efficient incremental backup but increase restore complexity.

iSCSI Protocol

iSCSI (Internet SCSI) encapsulates SCSI commands in TCP packets, enabling block storage over standard Ethernet:

Initiator: the compute node (iSCSI client); discovers and connects to targets via iSCSI login.
Target: the storage node; exposes one or more LUNs (logical volumes) to initiators.
SCSI PDUs: read (READ(10), READ(16)) and write (WRITE(10), WRITE(16)) commands wrapped in iSCSI PDUs transported over TCP.
Multi-path I/O: initiator maintains multiple TCP connections to different target ports; I/O is load-balanced across paths for performance and failover.

Replication

Synchronous replication: write is committed to the primary and all replicas before ACK is returned. Zero RPO. Adds latency proportional to round-trip time to the farthest replica. Suitable for same-datacenter or metro-area deployments.

Asynchronous replication: write is committed to the primary, ACK returned, and replication happens in the background. Low write latency but non-zero RPO (data loss up to the replication lag on failure). Suitable for cross-region disaster recovery replicas.

RAID Levels

RAID-0 (striping): data striped across N disks; bandwidth scales as N*disk_bandwidth; no redundancy — any disk failure causes data loss. Used for scratch space or in-volume striping above a replicated layer.
RAID-1 (mirroring): each block written to 2 disks; 2x write overhead; any single disk failure tolerated; read bandwidth can scale to 2x with parallel reads.
RAID-5 (striping + distributed parity): data and parity striped across N disks; tolerates 1 disk failure; storage overhead 1/N of a disk; write amplification for parity updates (read-modify-write per stripe on small writes).
RAID-6: like RAID-5 but with 2 parity disks; tolerates 2 simultaneous disk failures.

SQL DDL

CREATE TABLE Volume (
    id                BIGSERIAL PRIMARY KEY,
    name              VARCHAR(255)  NOT NULL UNIQUE,
    size_bytes        BIGINT        NOT NULL,
    provisioned_bytes BIGINT        NOT NULL DEFAULT 0,  -- thin: allocated so far
    snapshot_id       BIGINT,       -- active snapshot (FK to Snapshot)
    status            VARCHAR(32)   NOT NULL DEFAULT 'available',
    created_at        TIMESTAMPTZ   NOT NULL DEFAULT now()
);

-- Block address map (sparse for thin-provisioned volumes)
CREATE TABLE VolumeBlock (
    volume_id       BIGINT        NOT NULL REFERENCES Volume(id),
    logical_block   BIGINT        NOT NULL,
    physical_node   VARCHAR(128)  NOT NULL,
    physical_offset BIGINT        NOT NULL,
    PRIMARY KEY (volume_id, logical_block)
);

CREATE TABLE Snapshot (
    id                BIGSERIAL PRIMARY KEY,
    volume_id         BIGINT        NOT NULL REFERENCES Volume(id),
    parent_snapshot_id BIGINT       REFERENCES Snapshot(id),
    created_at        TIMESTAMPTZ   NOT NULL DEFAULT now(),
    size_bytes        BIGINT        NOT NULL DEFAULT 0
);

Python: Core Operations

import hashlib
import os
from typing import Optional

BLOCK_SIZE = 4096  # 4 KB blocks

# In-memory simulation: volume_id -> {logical_block -> data}
_volumes: dict[int, dict] = {}
_snapshots: dict[int, dict] = {}  # snapshot_id -> {logical_block -> data}
_snapshot_of: dict[int, int] = {}  # snapshot_id -> volume_id

def create_volume(name: str, size_gb: int, thin: bool = True) -> int:
    """Create a thin or thick provisioned volume."""
    vol_id = len(_volumes) + 1
    if thin:
        _volumes[vol_id] = {}  # empty map; physical blocks allocated on first write
    else:
        block_count = (size_gb * 1024 * 1024 * 1024) // BLOCK_SIZE
        _volumes[vol_id] = {i: b'x00' * BLOCK_SIZE for i in range(block_count)}
    print(f"Volume {vol_id} ({name}, {size_gb}GB, thin={thin}) created")
    return vol_id

def create_snapshot(volume_id: int) -> int:
    """Create a CoW snapshot of a volume."""
    snap_id = len(_snapshots) + 1
    # Snapshot starts as a shallow copy of the block map (CoW: blocks shared)
    _snapshots[snap_id] = dict(_volumes[volume_id])  # copy references, not data
    _snapshot_of[snap_id] = volume_id
    print(f"Snapshot {snap_id} created for volume {volume_id}")
    return snap_id

def read_block(volume_id: int, logical_block: int) -> bytes:
    """Read a block from a volume. Returns zeros for unallocated thin blocks."""
    return _volumes.get(volume_id, {}).get(logical_block, b'x00' * BLOCK_SIZE)

def write_block(volume_id: int, logical_block: int, data: bytes) -> None:
    """Write a block, performing CoW for any active snapshot."""
    assert len(data) == BLOCK_SIZE, f"Block must be {BLOCK_SIZE} bytes"
    # CoW: if any snapshot references the current block, save original first
    for snap_id, snap_vol_id in _snapshot_of.items():
        if snap_vol_id == volume_id:
            if logical_block not in _snapshots[snap_id]:
                # Save original block into snapshot before overwriting
                original = _volumes.get(volume_id, {}).get(logical_block, b'x00' * BLOCK_SIZE)
                _snapshots[snap_id][logical_block] = original
    # Write new data to volume
    if volume_id not in _volumes:
        _volumes[volume_id] = {}
    _volumes[volume_id][logical_block] = data

Design Considerations Summary

Thin vs thick provisioning: thin for flexibility and storage efficiency; thick for guaranteed performance and no allocation latency.
CoW snapshot overhead: first write to any snapshotted block incurs extra read + write; deep snapshot chains compound this effect.
iSCSI vs NVMe-oF: iSCSI on commodity Ethernet for cost-effective SAN; NVMe-oF with RDMA for near-local NVMe latency in high-performance clusters.
Synchronous vs async replication: synchronous for zero RPO in same DC; async for geo-distributed DR with acceptable RPO.
RAID: RAID-5/6 for storage efficiency with parity; RAID-1 for simplicity and fast degraded reads; avoid RAID-5 with large HDDs due to high rebuild failure probability.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does thin provisioning differ from thick provisioning?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Thick provisioning pre-allocates all physical blocks at volume creation time, guaranteeing space but wasting it if the volume is not fully used; thin provisioning allocates physical blocks only on first write, overcommitting physical storage against logical capacity.”
}
},
{
“@type”: “Question”,
“name”: “How does copy-on-write snapshot work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a snapshot is taken, subsequent writes to the original volume first copy the original block to the snapshot store before overwriting; reads from the snapshot retrieve the saved copy; only modified blocks incur the copy overhead.”
}
},
{
“@type”: “Question”,
“name”: “What is iSCSI and how does it differ from NVMe-oF?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “iSCSI encapsulates SCSI commands in TCP/IP packets, using standard network infrastructure; NVMe-oF (NVMe over Fabrics) uses RDMA or TCP with the NVMe protocol, offering significantly lower latency and higher IOPS suitable for latency-sensitive workloads.”
}
},
{
“@type”: “Question”,
“name”: “How does synchronous replication affect write latency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Synchronous replication requires acknowledgment from both primary and replica before confirming the write to the client; this doubles the write path latency by the replica's round-trip time; asynchronous replication confirms immediately after the primary write, with replica lag accepted.”
}
}
]
}