Why Hot-Cold Tiering?
Not all data is accessed equally. A small fraction of items (the “hot” set) drives the majority of reads, while the bulk of the dataset (the “cold” set) is rarely or never accessed. Storing all data on fast, expensive SSD storage wastes money. Hot-cold tiering automatically moves data between storage tiers based on access frequency, delivering both performance and cost efficiency.
Typical cost differential: SSD NVMe at $0.10/GB/month vs S3 object storage at $0.023/GB/month — over 4x savings for cold data, and real-world workloads often have 80-90% of data in the cold tier.
Access Frequency Tracking
Each item maintains an access counter with time-decay. Two common models:
- LFU with exponential decay: access count is halved every N hours. Recent accesses count more than old ones. An item accessed 100 times two weeks ago may score lower than one accessed 10 times today.
- Sliding window count: count accesses in the last 30 days only. Simple; no decay math required. Old accesses fall off naturally as the window slides.
On every read, access_count_30d is incremented and last_accessed_at is updated. A background job re-computes counts by querying the access event log, keeping the counter accurate over rolling windows.
Hot Promotion Threshold
An item is promoted to the hot tier if its access count exceeds a threshold within the tracking window. For example: promote if access_count_30d > 100. The threshold is tunable per data class — user profile photos may have a lower threshold than product images.
Promotion is executed by the background tiering job, not on the read path, to avoid write amplification on every access.
Cold Migration
Items not accessed in N days (configurable per tier) are migrated to cold storage:
- Background batch job queries
DataItem WHERE last_accessed_at < now() - interval 'N days' AND tier = 'hot'. - For each candidate: compress the item, upload to S3 (or HDD path), record the
cold_storage_keyin the metadata store. - Mark the item as
tier = 'cold'in the hot-tier metadata store. - Optionally delete the hot-tier copy (or retain for a short grace period).
Migration is async and non-blocking. Reads during migration are served from the source (hot) tier until the migration flag is cleared.
Retrieval on Cold Miss
When a read request arrives for a cold item:
- Look up
cold_storage_keyfrom the metadata store. - Fetch the compressed object from S3/HDD.
- Decompress and serve the item.
- Evaluate whether to re-promote: if the item's access pattern crosses the hot threshold again, schedule a promotion job.
Cold retrieval adds latency (10ms-1s depending on storage backend). Surface this to callers via a cache-hit header or metric so SLAs can be set appropriately.
Warm Tier (Optional Middle Layer)
A warm tier (e.g., HDD local disk) can sit between hot SSD and cold object storage. Items moving from hot to cold pass through warm first, allowing re-promotion without a full S3 round-trip. Warm tier adds operational complexity but reduces re-promotion latency.
SQL Schema
CREATE TABLE DataItem (
id BIGSERIAL PRIMARY KEY,
item_key TEXT UNIQUE NOT NULL,
tier TEXT NOT NULL DEFAULT 'hot' CHECK (tier IN ('hot','warm','cold')),
access_count_30d INT NOT NULL DEFAULT 0,
last_accessed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
cold_storage_key TEXT,
size_bytes BIGINT,
checksum TEXT
);
CREATE INDEX idx_item_tier_access ON DataItem (tier, last_accessed_at);
CREATE TABLE TierMigrationJob (
id BIGSERIAL PRIMARY KEY,
item_id BIGINT REFERENCES DataItem(id),
from_tier TEXT NOT NULL,
to_tier TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ,
bytes_migrated BIGINT,
status TEXT NOT NULL DEFAULT 'running' CHECK (status IN ('running','complete','failed'))
);
CREATE TABLE AccessEvent (
id BIGSERIAL PRIMARY KEY,
item_id BIGINT REFERENCES DataItem(id),
accessed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_access_item_time ON AccessEvent (item_id, accessed_at);
Python Implementation Sketch
import boto3, time, gzip, json
from datetime import datetime, timedelta
class TieringManager:
def __init__(self, db, s3_bucket: str):
self.db = db
self.s3 = boto3.client('s3')
self.s3_bucket = s3_bucket
self.hot_threshold = 100 # accesses in 30 days
self.cold_after_days = 30 # migrate if not accessed in N days
def track_access(self, item_id: int):
self.db.execute(
"INSERT INTO AccessEvent (item_id) VALUES (%s)", (item_id,)
)
self.db.execute(
"UPDATE DataItem SET access_count_30d = access_count_30d + 1, last_accessed_at = now() WHERE id = %s",
(item_id,)
)
def evaluate_tier(self, item_id: int) -> str:
row = self.db.fetchone(
"SELECT tier, access_count_30d, last_accessed_at FROM DataItem WHERE id = %s", (item_id,)
)
if not row:
return 'unknown'
if row['tier'] == 'cold' and row['access_count_30d'] >= self.hot_threshold:
return 'promote'
if row['tier'] == 'hot' and (datetime.utcnow() - row['last_accessed_at']).days >= self.cold_after_days:
return 'demote'
return row['tier']
def migrate_to_cold(self, item_id: int):
row = self.db.fetchone("SELECT * FROM DataItem WHERE id = %s", (item_id,))
job_id = self.db.fetchone(
"INSERT INTO TierMigrationJob (item_id, from_tier, to_tier) VALUES (%s, %s, %s) RETURNING id",
(item_id, 'hot', 'cold')
)['id']
try:
data = self._load_from_hot(row['item_key'])
compressed = gzip.compress(json.dumps(data).encode())
cold_key = f"cold/{item_id}/{row['item_key']}"
self.s3.put_object(Bucket=self.s3_bucket, Key=cold_key, Body=compressed)
self.db.execute(
"UPDATE DataItem SET tier = 'cold', cold_storage_key = %s WHERE id = %s",
(cold_key, item_id)
)
self.db.execute(
"UPDATE TierMigrationJob SET completed_at = now(), bytes_migrated = %s, status = 'complete' WHERE id = %s",
(len(compressed), job_id)
)
except Exception as e:
self.db.execute("UPDATE TierMigrationJob SET status = 'failed' WHERE id = %s", (job_id,))
raise
def retrieve_cold(self, item_id: int) -> dict:
row = self.db.fetchone("SELECT cold_storage_key FROM DataItem WHERE id = %s", (item_id,))
obj = self.s3.get_object(Bucket=self.s3_bucket, Key=row['cold_storage_key'])
compressed = obj['Body'].read()
data = json.loads(gzip.decompress(compressed).decode())
self.track_access(item_id)
if self.evaluate_tier(item_id) == 'promote':
# Schedule async promotion back to hot
pass
return data
def _load_from_hot(self, item_key: str) -> dict:
return self.db.fetchone("SELECT * FROM hot_storage WHERE item_key = %s", (item_key,))
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does access decay work in hot-cold tiering?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Access decay reduces the weight of older accesses over time, ensuring that items that were popular long ago do not block cold migration indefinitely. In LFU with exponential decay, the access count is halved periodically (e.g., every 12 hours). A sliding window count achieves similar results by only counting accesses within a fixed recent window (e.g., 30 days), letting old accesses fall off naturally.”
}
},
{
“@type”: “Question”,
“name”: “What threshold triggers migration to the cold tier?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Migration is triggered when an item has not been accessed for N days (e.g., 30 days). The threshold is configurable per data class. The tiering job queries for hot-tier items where last_accessed_at is older than the threshold and schedules them for cold migration. Items near the threshold can be placed in a warm tier as a buffer before full cold migration.”
}
},
{
“@type”: “Question”,
“name”: “What is the retrieval latency for cold-tier data?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cold-tier retrieval latency depends on the storage backend. S3 GET requests typically complete in 10-100ms for small objects. HDD sequential reads may be 1-10ms but require the object to be on a local or NAS-attached disk. Decompression adds CPU time proportional to object size. Applications should surface cold hits to callers via response headers or metrics so SLAs can account for the higher p99 latency.”
}
},
{
“@type”: “Question”,
“name”: “When should a cold item be re-promoted to the hot tier?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Re-promotion should be considered when a cold item's access count crosses the hot-tier promotion threshold again within the tracking window. This is evaluated after each cold retrieval. Re-promotion is executed asynchronously by the tiering job to avoid blocking the read path. A hysteresis margin (e.g., require 2x the demotion threshold for re-promotion) prevents thrashing between tiers.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is access frequency tracked for tiering decisions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Access frequency is tracked using a sliding-window counter or an approximate data structure such as a Count-Min Sketch per object key, recording read and write hits over configurable time windows (e.g., last 1h and last 24h). A tiering daemon periodically evaluates each object's frequency score against configurable hot/cold thresholds and marks objects for promotion or demotion accordingly.”
}
},
{
“@type”: “Question”,
“name”: “How is data migrated between tiers without downtime?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Migration follows a copy-then-delete pattern: the object is written to the destination tier first, the metadata pointer is atomically updated to reference the new location, and only then the source copy is deleted. During the migration window, reads are served from the source tier or redirected transparently so clients experience no interruption.”
}
},
{
“@type”: “Question”,
“name”: “How does a read from cold tier work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A read miss on the hot tier triggers a fetch from cold storage (e.g., object store or tape), which is returned to the caller while optionally being promoted back to the hot tier if access frequency warrants it. Because cold-tier reads can have latency measured in seconds or minutes, the system may queue the request and deliver results asynchronously, surfacing a 202 Accepted response and polling endpoint to the caller.”
}
},
{
“@type”: “Question”,
“name”: “How are tiering thresholds tuned?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Thresholds are tuned by modeling the cost function that balances storage cost per tier against the latency and bandwidth penalty of cold reads, then selecting the access-frequency cutoff that minimizes total cost given observed access distributions. Adaptive systems continuously adjust thresholds using feedback from hit-rate metrics, storage utilization, and egress costs rather than relying on static configuration.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture