Data Archival Low-Level Design: Tiered Storage, Partition Detach, and S3 Cold Storage

A data archival system moves aged, infrequently accessed data from hot storage (fast, expensive) to warm or cold tiers (slower, cheap) without losing query access. This keeps the primary database lean and fast while retaining compliance data affordably. Core challenges: deciding what to archive and when, making archive data queryable for audits, handling references between hot and archived data, and running the archival process without impacting production.

Storage Tiers

-- Tier 1: Hot (Postgres primary) — data accessed in last 90 days
-- Tier 2: Warm (Postgres read replica or separate Postgres instance) — 90 days to 2 years
-- Tier 3: Cold (S3 + Parquet/Athena or Glacier) — data older than 2 years

-- Archival job tracking
CREATE TABLE ArchivalJob (
    job_id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    table_name       TEXT NOT NULL,
    archive_before   TIMESTAMPTZ NOT NULL,   -- archive rows older than this
    status           TEXT NOT NULL DEFAULT 'pending',
    rows_archived    BIGINT NOT NULL DEFAULT 0,
    last_archived_id BIGINT,                 -- checkpoint for resume
    started_at       TIMESTAMPTZ,
    completed_at     TIMESTAMPTZ,
    error            TEXT
);

-- Example: Order table archival marker
-- Rather than physically moving rows, add an is_archived flag
-- Hot table: WHERE is_archived = FALSE (partial index)
-- Archive table: archived_orders (separate schema or DB)

Partition-Based Archival (Preferred)

"""
For tables already partitioned by date (e.g., ORDER BY placed_at):
Archive = DETACH PARTITION (O(1)) + ATTACH to archive schema.
No row-by-row copying needed.

Example: archive OrderItem partitions older than 2 years.
"""

def archive_partition(conn, table_name: str, partition_name: str, archive_schema: str = "archive"):
    """
    Detach a partition from the hot table and re-attach to the archive schema.
    O(1) operation — no row copying.
    Requires the archive schema to have an identical partitioned parent table.
    """
    with conn.cursor() as cur:
        # 1. Detach from hot table
        cur.execute(f"ALTER TABLE {table_name} DETACH PARTITION {partition_name}")

        # 2. Attach to archive table (same structure, different schema)
        archive_parent = f"{archive_schema}.{table_name}"
        cur.execute(f"ALTER TABLE {archive_parent} ATTACH PARTITION {partition_name} "
                    f"FOR VALUES FROM (SELECT min_val FROM ...) TO (SELECT max_val FROM ...)")

        # 3. Log the archival
        cur.execute("""
            INSERT INTO ArchivalJob (table_name, archive_before, status, rows_archived)
            SELECT %s, NOW(), 'done', COUNT(*) FROM {partition_name}
        """.format(partition_name=partition_name), (table_name,))
    conn.commit()

Row-by-Row Archival (For Non-Partitioned Tables)

import psycopg2, json
import boto3

s3 = boto3.client('s3')
ARCHIVE_BUCKET = 'techinterview-archive'
BATCH_SIZE = 1000

def archive_old_orders(conn, archive_before: str, job_id: str) -> int:
    """
    Batch-archive Order rows older than archive_before.
    Write to S3 as newline-delimited JSON (NDJSON).
    Delete from hot table only after S3 write confirmed.
    Uses checkpoint (last_archived_id) to resume after failure.
    """
    # Load checkpoint
    with conn.cursor() as cur:
        cur.execute("SELECT last_archived_id FROM ArchivalJob WHERE job_id = %s", (job_id,))
        row = cur.fetchone()
    last_id = row[0] if row and row[0] else 0

    total_archived = 0
    while True:
        # Fetch next batch using keyset pagination
        with conn.cursor() as cur:
            cur.execute("""
                SELECT order_id, user_id, status, total_cents, placed_at
                FROM Order
                WHERE placed_at  %s
                ORDER BY order_id
                LIMIT %s
            """, (archive_before, last_id, BATCH_SIZE))
            rows = cur.fetchall()

        if not rows:
            break

        # Serialize to NDJSON
        batch_date = rows[0][4].strftime("%Y/%m/%d")
        s3_key = f"orders/{batch_date}/batch_{last_id}_{rows[-1][0]}.ndjson"
        ndjson = "n".join(
            json.dumps({"order_id": str(r[0]), "user_id": str(r[1]),
                        "status": r[2], "total_cents": r[3],
                        "placed_at": r[4].isoformat()})
            for r in rows
        )

        # Write to S3 first — only delete from DB after confirmed
        s3.put_object(Bucket=ARCHIVE_BUCKET, Key=s3_key, Body=ndjson.encode())

        # Delete from hot table (bounded batch to avoid long lock)
        ids = [r[0] for r in rows]
        with conn.cursor() as cur:
            cur.execute("DELETE FROM Order WHERE order_id = ANY(%s)", (ids,))

        # Checkpoint
        last_id = rows[-1][0]
        with conn.cursor() as cur:
            cur.execute(
                "UPDATE ArchivalJob SET last_archived_id=%s, rows_archived=rows_archived+%s WHERE job_id=%s",
                (last_id, len(rows), job_id)
            )
        conn.commit()
        total_archived += len(rows)

    return total_archived

Querying Archived Data

"""
Archive query strategies:

1. Postgres Foreign Data Wrapper (FDW):
   Expose S3 Parquet files via parquet_s3_fdw.
   SELECT * FROM archive.orders WHERE placed_at BETWEEN '2022-01-01' AND '2022-12-31';
   Query runs transparently across hot + archive.

2. AWS Athena:
   Point Athena at the S3 NDJSON/Parquet path.
   For compliance exports: run Athena query, download CSV, done.
   Cost: $5/TB scanned.

3. Federated view (application layer):
   GET /orders/{id} — check hot DB first; if not found, query archive API.
   Archive API queries S3 via Athena or a dedicated warm Postgres instance.
"""

def get_order(conn, order_id: str) -> dict | None:
    """Check hot DB first, fall back to archive."""
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM Order WHERE order_id = %s", (order_id,))
        row = cur.fetchone()
    if row:
        return dict(zip([d[0] for d in conn.cursor().description], row))

    # Not in hot DB — check archive
    return get_order_from_archive(order_id)

def get_order_from_archive(order_id: str) -> dict | None:
    """Query Athena for archived order."""
    import boto3
    athena = boto3.client('athena')
    result = athena.start_query_execution(
        QueryString=f"SELECT * FROM archive.orders WHERE order_id = '{order_id}' LIMIT 1",
        ResultConfiguration={"OutputLocation": f"s3://{ARCHIVE_BUCKET}/athena-results/"}
    )
    # Wait for query and fetch results (simplified)
    return None  # actual implementation polls for query completion

Key Interview Points

Partition detach is O(1): For partitioned tables, ALTER TABLE … DETACH PARTITION is a catalog operation that takes milliseconds regardless of partition size. This is the gold standard for archival — no row scanning, no locking of the hot table, no data duplication. Always design high-volume tables (orders, events, logs) as partitioned from day one specifically to enable this archival pattern.
Write to archive before deleting from hot: The delete happens only after S3 put_object confirms. If S3 write fails, the row stays in the hot table — no data loss. If S3 succeeds but delete fails, the row exists in both places — idempotent on re-run (S3 PUT is idempotent, DELETE is idempotent). Never delete first — that’s data loss on S3 failure.
Bounded batch deletions: DELETE FROM Order WHERE order_id = ANY(1000 IDs) holds a lock for the duration of the transaction. Large deletes block concurrent reads/writes. Batch size of 100–1,000 rows keeps lock duration under 50ms. Run archival during off-peak hours. Monitor: track autovacuum dead-tuple count — archival generates dead tuples that autovacuum must clean up.
Referential integrity across tiers: An OrderItem row in the hot table may reference an Order row in the archive. Before deleting Order rows, ensure no hot OrderItem references them. Strategy: archive OrderItem before Order (child before parent), or use soft-deletes on hot (is_archived flag) before physical deletion. The is_archived flag keeps the PK in the hot table for FK satisfaction while allowing queries to filter with a partial index (WHERE is_archived = FALSE).
Cost calculation: Postgres RDS charges ~$0.12/GB-month for gp2 storage. S3 charges ~$0.023/GB-month. For a 10TB table with 90% cold data: hot Postgres cost = 1TB × $0.12 = $120/month; archive S3 cost = 9TB × $0.023 = $207/month. Total = $327/month vs. $1,200/month to keep it all hot. More important: 10TB Postgres runs significantly slower than 1TB — query planner, vacuum, and backup performance all degrade with table size.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why is ALTER TABLE DETACH PARTITION the preferred archival method?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”DETACH PARTITION is a catalog operation — it only modifies the system tables that track partition membership, not the data itself. This means: no rows are scanned, no data is copied, no locks are held on the table data. A partition with 1 billion rows detaches in milliseconds. By contrast, DELETE WHERE created_at < cutoff on a 1B-row table takes hours, generates massive WAL (filling up disk), and holds row locks that block concurrent reads and writes. After detaching, the partition still exists as a standalone table — you can query it, attach it to an archive schema, COPY it to S3 as Parquet, or DROP it. Always design high-volume time-series tables as partitioned from day one specifically to enable this O(1) archival pattern.”}},{“@type”:”Question”,”name”:”How do you ensure data is safely in S3 before deleting from the hot database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The deletion must only happen after the S3 write is confirmed. Pattern: (1) SELECT the batch of rows to archive; (2) serialize to NDJSON or Parquet; (3) PUT to S3 — wait for the 200 OK response (do not fire-and-forget); (4) verify: GET the S3 object metadata or compute ETag to confirm the write landed; (5) DELETE the rows from the hot table. If the process crashes between step 3 and 5, the rows remain in both S3 and the hot table — duplicated but not lost. The archival job is idempotent: running it again re-uploads the same data to S3 (PUT is idempotent for the same key) and re-deletes the rows (already deleted — 0 rows affected). Never delete first: data loss on S3 failure is unrecoverable.”}},{“@type”:”Question”,”name”:”How do you query archived data without pulling it back into the hot database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”AWS Athena allows SQL queries directly over S3-stored files (JSON, Parquet, CSV) using a serverless query engine. Setup: create an Athena table pointing at the S3 prefix: LOCATION ‘s3://archive-bucket/orders/’. Query: SELECT * FROM archive.orders WHERE placed_at BETWEEN ‘2022-01-01’ AND ‘2022-12-31’ AND user_id = ‘abc’. Athena scans only relevant Parquet files (partition pruning by year/month directory structure). Cost: $5 per TB scanned — a 500GB query costs $2.50. For compliance exports: run the query, download the result CSV, done. For application-level access: federated query (check hot DB first; if not found, query Athena via boto3). Athena query latency is 2–30 seconds — unsuitable for real-time API responses but fine for admin tools and compliance.”}},{“@type”:”Question”,”name”:”How do you handle foreign key constraints when archiving parent rows?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”An Order row may be referenced by OrderItem, OrderEvent, and Shipment rows. You cannot DELETE the Order row while child rows reference it (foreign key violation). Correct archive order: archive children before parents. Sequence: (1) archive OrderItem → S3, delete from hot; (2) archive OrderEvent → S3, delete from hot; (3) archive Shipment → S3, delete from hot; (4) archive Order → S3, delete from hot. All steps in one job, transactionally consistent per batch. Alternative: use soft-delete (is_archived flag on the Order row) — the PK remains in the hot table satisfying FK constraints, but the row is excluded from normal queries via a partial index (WHERE is_archived = FALSE). Physical deletion happens in a later cleanup pass after all child rows are archived.”}},{“@type”:”Question”,”name”:”How do you monitor archival job health and detect data loss?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Archival jobs that run silently can have bugs that silently lose data — rows deleted from hot before they were written to S3. Monitoring: (1) row count check: after each batch, verify row_count(S3 batch) == row_count(deleted from DB); (2) reconciliation query: every week, sample 1% of archived row IDs and verify they exist in S3; (3) ArchivalJob table tracking rows_archived, started_at, completed_at — alert if a job doesn’t complete within its expected window; (4) compare monthly row counts between hot DB (after archival) and S3 (new data) — total should match expectations based on write rate; (5) set up AWS S3 event notifications to detect unexpected deletes from the archive bucket (should never happen — alert if it does). Never run archival without automated verification.”}}]}

Data archival and tiered storage system design is discussed in Databricks system design interview questions.

Data archival and cold storage system design is covered in Amazon system design interview preparation.

Data archival and large-scale data lifecycle design is discussed in Google system design interview guide.