Data Archival Low-Level Design: Tiered Storage, Partition Detach, and S3 Cold Storage

A data archival system moves aged, infrequently accessed data from hot storage (fast, expensive) to warm or cold tiers (slower, cheap) without losing query access. This keeps the primary database lean and fast while retaining compliance data affordably. Core challenges: deciding what to archive and when, making archive data queryable for audits, handling references between hot and archived data, and running the archival process without impacting production.

Storage Tiers

-- Tier 1: Hot (Postgres primary) — data accessed in last 90 days
-- Tier 2: Warm (Postgres read replica or separate Postgres instance) — 90 days to 2 years
-- Tier 3: Cold (S3 + Parquet/Athena or Glacier) — data older than 2 years

-- Archival job tracking
CREATE TABLE ArchivalJob (
    job_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    table_name       TEXT NOT NULL,
    archive_before   TIMESTAMPTZ NOT NULL,   -- archive rows older than this
    status           TEXT NOT NULL DEFAULT 'pending',
    rows_archived    BIGINT NOT NULL DEFAULT 0,
    last_archived_id BIGINT,                 -- checkpoint for resume
    started_at       TIMESTAMPTZ,
    completed_at     TIMESTAMPTZ,
    error            TEXT
);

-- Example: Order table archival marker
-- Rather than physically moving rows, add an is_archived flag
-- Hot table: WHERE is_archived = FALSE (partial index)
-- Archive table: archived_orders (separate schema or DB)

Partition-Based Archival (Preferred)

"""
For tables already partitioned by date (e.g., ORDER BY placed_at):
Archive = DETACH PARTITION (O(1)) + ATTACH to archive schema.
No row-by-row copying needed.

Example: archive OrderItem partitions older than 2 years.
"""

def archive_partition(conn, table_name: str, partition_name: str, archive_schema: str = "archive"):
    """
    Detach a partition from the hot table and re-attach to the archive schema.
    O(1) operation — no row copying.
    Requires the archive schema to have an identical partitioned parent table.
    """
    with conn.cursor() as cur:
        # 1. Detach from hot table
        cur.execute(f"ALTER TABLE {table_name} DETACH PARTITION {partition_name}")

        # 2. Attach to archive table (same structure, different schema)
        archive_parent = f"{archive_schema}.{table_name}"
        cur.execute(f"ALTER TABLE {archive_parent} ATTACH PARTITION {partition_name} "
                    f"FOR VALUES FROM (SELECT min_val FROM ...) TO (SELECT max_val FROM ...)")

        # 3. Log the archival
        cur.execute("""
            INSERT INTO ArchivalJob (table_name, archive_before, status, rows_archived)
            SELECT %s, NOW(), 'done', COUNT(*) FROM {partition_name}
        """.format(partition_name=partition_name), (table_name,))
    conn.commit()

Row-by-Row Archival (For Non-Partitioned Tables)

import psycopg2, json
import boto3

s3 = boto3.client('s3')
ARCHIVE_BUCKET = 'techinterview-archive'
BATCH_SIZE = 1000

def archive_old_orders(conn, archive_before: str, job_id: str) -> int:
    """
    Batch-archive Order rows older than archive_before.
    Write to S3 as newline-delimited JSON (NDJSON).
    Delete from hot table only after S3 write confirmed.
    Uses checkpoint (last_archived_id) to resume after failure.
    """
    # Load checkpoint
    with conn.cursor() as cur:
        cur.execute("SELECT last_archived_id FROM ArchivalJob WHERE job_id = %s", (job_id,))
        row = cur.fetchone()
    last_id = row[0] if row and row[0] else 0

    total_archived = 0
    while True:
        # Fetch next batch using keyset pagination
        with conn.cursor() as cur:
            cur.execute("""
                SELECT order_id, user_id, status, total_cents, placed_at
                FROM Order
                WHERE placed_at  %s
                ORDER BY order_id
                LIMIT %s
            """, (archive_before, last_id, BATCH_SIZE))
            rows = cur.fetchall()

        if not rows:
            break

        # Serialize to NDJSON
        batch_date = rows[0][4].strftime("%Y/%m/%d")
        s3_key = f"orders/{batch_date}/batch_{last_id}_{rows[-1][0]}.ndjson"
        ndjson = "n".join(
            json.dumps({"order_id": str(r[0]), "user_id": str(r[1]),
                        "status": r[2], "total_cents": r[3],
                        "placed_at": r[4].isoformat()})
            for r in rows
        )

        # Write to S3 first — only delete from DB after confirmed
        s3.put_object(Bucket=ARCHIVE_BUCKET, Key=s3_key, Body=ndjson.encode())

        # Delete from hot table (bounded batch to avoid long lock)
        ids = [r[0] for r in rows]
        with conn.cursor() as cur:
            cur.execute("DELETE FROM Order WHERE order_id = ANY(%s)", (ids,))

        # Checkpoint
        last_id = rows[-1][0]
        with conn.cursor() as cur:
            cur.execute(
                "UPDATE ArchivalJob SET last_archived_id=%s, rows_archived=rows_archived+%s WHERE job_id=%s",
                (last_id, len(rows), job_id)
            )
        conn.commit()
        total_archived += len(rows)

    return total_archived

Querying Archived Data

"""
Archive query strategies:

1. Postgres Foreign Data Wrapper (FDW):
   Expose S3 Parquet files via parquet_s3_fdw.
   SELECT * FROM archive.orders WHERE placed_at BETWEEN '2022-01-01' AND '2022-12-31';
   Query runs transparently across hot + archive.

2. AWS Athena:
   Point Athena at the S3 NDJSON/Parquet path.
   For compliance exports: run Athena query, download CSV, done.
   Cost: $5/TB scanned.

3. Federated view (application layer):
   GET /orders/{id} — check hot DB first; if not found, query archive API.
   Archive API queries S3 via Athena or a dedicated warm Postgres instance.
"""

def get_order(conn, order_id: str) -> dict | None:
    """Check hot DB first, fall back to archive."""
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM Order WHERE order_id = %s", (order_id,))
        row = cur.fetchone()
    if row:
        return dict(zip([d[0] for d in conn.cursor().description], row))

    # Not in hot DB — check archive
    return get_order_from_archive(order_id)

def get_order_from_archive(order_id: str) -> dict | None:
    """Query Athena for archived order."""
    import boto3
    athena = boto3.client('athena')
    result = athena.start_query_execution(
        QueryString=f"SELECT * FROM archive.orders WHERE order_id = '{order_id}' LIMIT 1",
        ResultConfiguration={"OutputLocation": f"s3://{ARCHIVE_BUCKET}/athena-results/"}
    )
    # Wait for query and fetch results (simplified)
    return None  # actual implementation polls for query completion

Key Interview Points

  • Partition detach is O(1): For partitioned tables, ALTER TABLE … DETACH PARTITION is a catalog operation that takes milliseconds regardless of partition size. This is the gold standard for archival — no row scanning, no locking of the hot table, no data duplication. Always design high-volume tables (orders, events, logs) as partitioned from day one specifically to enable this archival pattern.
  • Write to archive before deleting from hot: The delete happens only after S3 put_object confirms. If S3 write fails, the row stays in the hot table — no data loss. If S3 succeeds but delete fails, the row exists in both places — idempotent on re-run (S3 PUT is idempotent, DELETE is idempotent). Never delete first — that’s data loss on S3 failure.
  • Bounded batch deletions: DELETE FROM Order WHERE order_id = ANY(1000 IDs) holds a lock for the duration of the transaction. Large deletes block concurrent reads/writes. Batch size of 100–1,000 rows keeps lock duration under 50ms. Run archival during off-peak hours. Monitor: track autovacuum dead-tuple count — archival generates dead tuples that autovacuum must clean up.
  • Referential integrity across tiers: An OrderItem row in the hot table may reference an Order row in the archive. Before deleting Order rows, ensure no hot OrderItem references them. Strategy: archive OrderItem before Order (child before parent), or use soft-deletes on hot (is_archived flag) before physical deletion. The is_archived flag keeps the PK in the hot table for FK satisfaction while allowing queries to filter with a partial index (WHERE is_archived = FALSE).
  • Cost calculation: Postgres RDS charges ~$0.12/GB-month for gp2 storage. S3 charges ~$0.023/GB-month. For a 10TB table with 90% cold data: hot Postgres cost = 1TB × $0.12 = $120/month; archive S3 cost = 9TB × $0.023 = $207/month. Total = $327/month vs. $1,200/month to keep it all hot. More important: 10TB Postgres runs significantly slower than 1TB — query planner, vacuum, and backup performance all degrade with table size.

Data archival and tiered storage system design is discussed in Databricks system design interview questions.

Data archival and cold storage system design is covered in Amazon system design interview preparation.

Data archival and large-scale data lifecycle design is discussed in Google system design interview guide.

Scroll to Top