A block storage system exposes raw storage as fixed-size blocks (typically 4 KB or 512 bytes) arranged into logical volumes. The compute layer (virtual machine, container) uses block storage as if it were a local disk, issuing read/write commands to logical block addresses. Block storage is the foundation of cloud VM disks (AWS EBS, GCP Persistent Disk) and enterprise SAN systems.
Volume Abstraction
A logical volume presents a contiguous block address space (e.g., 0 to N-1 blocks) to the compute layer, while the storage system maps logical block addresses to physical locations on storage nodes. This indirection enables:
- Thin provisioning: allocate physical blocks only on first write, not at volume creation.
- Snapshots: preserve point-in-time state without duplicating all data.
- Replication: mirror writes to multiple physical locations transparently.
- Live migration: move volume data between storage nodes without downtime.
The mapping table (VolumeBlock) records: logical block address → (physical storage node, physical offset). For thin-provisioned volumes, unmapped logical blocks have no entry and are treated as zeros.
Thin Provisioning
At volume creation, only metadata is allocated — no physical storage blocks. On the first write to a logical block:
- Storage system checks VolumeBlock: no entry found (unallocated).
- A physical block is allocated from the free pool on the target storage node.
- VolumeBlock entry is created: logical_block → (physical_node, physical_offset).
- Data is written to the allocated physical block.
Subsequent writes to the same logical block update the data in-place; no re-allocation occurs. Reads to unallocated blocks return zeros without physical I/O.
Over-provisioning risk: if many thin-provisioned volumes grow beyond their expected usage, the physical pool can exhaust. The storage system must monitor physical utilisation and alert before exhaustion.
Copy-on-Write Snapshots
A snapshot captures the state of a volume at a point in time without copying all data. Instead, unchanged blocks are shared between the volume and the snapshot (copy-on-write semantics):
- Snapshot is created: snapshot metadata points to the same block map as the volume.
- A write arrives for logical block B:
- Check if block B has already been CoW-copied for this snapshot. If yes, write in place.
- If no: read original block B, write it to the snapshot's CoW store, update snapshot block map to point to the saved copy, then write new data to the volume's current block B.
- The snapshot now owns an immutable copy of the original block B; the volume has the new data.
Incremental snapshots: each snapshot records only the blocks that changed since the previous snapshot (the diff). Unchanged blocks are referenced via pointer to the parent snapshot. Snapshot chains allow efficient incremental backup but increase restore complexity.
iSCSI Protocol
iSCSI (Internet SCSI) encapsulates SCSI commands in TCP packets, enabling block storage over standard Ethernet:
- Initiator: the compute node (iSCSI client); discovers and connects to targets via iSCSI login.
- Target: the storage node; exposes one or more LUNs (logical volumes) to initiators.
- SCSI PDUs: read (READ(10), READ(16)) and write (WRITE(10), WRITE(16)) commands wrapped in iSCSI PDUs transported over TCP.
- Multi-path I/O: initiator maintains multiple TCP connections to different target ports; I/O is load-balanced across paths for performance and failover.
Replication
Synchronous replication: write is committed to the primary and all replicas before ACK is returned. Zero RPO. Adds latency proportional to round-trip time to the farthest replica. Suitable for same-datacenter or metro-area deployments.
Asynchronous replication: write is committed to the primary, ACK returned, and replication happens in the background. Low write latency but non-zero RPO (data loss up to the replication lag on failure). Suitable for cross-region disaster recovery replicas.
RAID Levels
- RAID-0 (striping): data striped across N disks; bandwidth scales as N*disk_bandwidth; no redundancy — any disk failure causes data loss. Used for scratch space or in-volume striping above a replicated layer.
- RAID-1 (mirroring): each block written to 2 disks; 2x write overhead; any single disk failure tolerated; read bandwidth can scale to 2x with parallel reads.
- RAID-5 (striping + distributed parity): data and parity striped across N disks; tolerates 1 disk failure; storage overhead 1/N of a disk; write amplification for parity updates (read-modify-write per stripe on small writes).
- RAID-6: like RAID-5 but with 2 parity disks; tolerates 2 simultaneous disk failures.
SQL DDL
CREATE TABLE Volume (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
size_bytes BIGINT NOT NULL,
provisioned_bytes BIGINT NOT NULL DEFAULT 0, -- thin: allocated so far
snapshot_id BIGINT, -- active snapshot (FK to Snapshot)
status VARCHAR(32) NOT NULL DEFAULT 'available',
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Block address map (sparse for thin-provisioned volumes)
CREATE TABLE VolumeBlock (
volume_id BIGINT NOT NULL REFERENCES Volume(id),
logical_block BIGINT NOT NULL,
physical_node VARCHAR(128) NOT NULL,
physical_offset BIGINT NOT NULL,
PRIMARY KEY (volume_id, logical_block)
);
CREATE TABLE Snapshot (
id BIGSERIAL PRIMARY KEY,
volume_id BIGINT NOT NULL REFERENCES Volume(id),
parent_snapshot_id BIGINT REFERENCES Snapshot(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
size_bytes BIGINT NOT NULL DEFAULT 0
);
Python: Core Operations
import hashlib
import os
from typing import Optional
BLOCK_SIZE = 4096 # 4 KB blocks
# In-memory simulation: volume_id -> {logical_block -> data}
_volumes: dict[int, dict] = {}
_snapshots: dict[int, dict] = {} # snapshot_id -> {logical_block -> data}
_snapshot_of: dict[int, int] = {} # snapshot_id -> volume_id
def create_volume(name: str, size_gb: int, thin: bool = True) -> int:
"""Create a thin or thick provisioned volume."""
vol_id = len(_volumes) + 1
if thin:
_volumes[vol_id] = {} # empty map; physical blocks allocated on first write
else:
block_count = (size_gb * 1024 * 1024 * 1024) // BLOCK_SIZE
_volumes[vol_id] = {i: b'x00' * BLOCK_SIZE for i in range(block_count)}
print(f"Volume {vol_id} ({name}, {size_gb}GB, thin={thin}) created")
return vol_id
def create_snapshot(volume_id: int) -> int:
"""Create a CoW snapshot of a volume."""
snap_id = len(_snapshots) + 1
# Snapshot starts as a shallow copy of the block map (CoW: blocks shared)
_snapshots[snap_id] = dict(_volumes[volume_id]) # copy references, not data
_snapshot_of[snap_id] = volume_id
print(f"Snapshot {snap_id} created for volume {volume_id}")
return snap_id
def read_block(volume_id: int, logical_block: int) -> bytes:
"""Read a block from a volume. Returns zeros for unallocated thin blocks."""
return _volumes.get(volume_id, {}).get(logical_block, b'x00' * BLOCK_SIZE)
def write_block(volume_id: int, logical_block: int, data: bytes) -> None:
"""Write a block, performing CoW for any active snapshot."""
assert len(data) == BLOCK_SIZE, f"Block must be {BLOCK_SIZE} bytes"
# CoW: if any snapshot references the current block, save original first
for snap_id, snap_vol_id in _snapshot_of.items():
if snap_vol_id == volume_id:
if logical_block not in _snapshots[snap_id]:
# Save original block into snapshot before overwriting
original = _volumes.get(volume_id, {}).get(logical_block, b'x00' * BLOCK_SIZE)
_snapshots[snap_id][logical_block] = original
# Write new data to volume
if volume_id not in _volumes:
_volumes[volume_id] = {}
_volumes[volume_id][logical_block] = data
Design Considerations Summary
- Thin vs thick provisioning: thin for flexibility and storage efficiency; thick for guaranteed performance and no allocation latency.
- CoW snapshot overhead: first write to any snapshotted block incurs extra read + write; deep snapshot chains compound this effect.
- iSCSI vs NVMe-oF: iSCSI on commodity Ethernet for cost-effective SAN; NVMe-oF with RDMA for near-local NVMe latency in high-performance clusters.
- Synchronous vs async replication: synchronous for zero RPO in same DC; async for geo-distributed DR with acceptable RPO.
- RAID: RAID-5/6 for storage efficiency with parity; RAID-1 for simplicity and fast degraded reads; avoid RAID-5 with large HDDs due to high rebuild failure probability.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture