Data Archival Strategy: Low-Level Design

Data archival is the systematic movement of infrequently accessed data from primary storage to cheaper, slower storage while maintaining the ability to retrieve it. As databases grow, old data degrades query performance, increases backup costs, and inflates storage bills. A well-designed archival system removes this burden from the primary database while preserving access for compliance, analytics, and support use cases.

Archival vs. Deletion

Deletion removes data permanently. Archival moves data to a secondary store where it remains accessible but no longer affects primary database performance. Choose archival when: regulatory requirements mandate retention (financial records 7 years, medical records 10+ years), business intelligence needs historical data, or customer support may need to retrieve old records. Choose deletion when: data has no retention requirement, GDPR erasure is requested, or the data truly has zero future value.

Archive Storage Tiers

Storage cost versus retrieval speed trade-off drives tier selection:

Hot tier (primary database): Full query capability, millisecond access, highest cost. Keep data here while it is actively queried — typically the most recent 90-180 days depending on traffic patterns.

Warm tier (archive tables or data warehouse): Queryable via SQL, seconds-to-minutes access, 10x cheaper than primary. Suitable for compliance reporting, historical analytics, and customer support lookups. Options: separate PostgreSQL instance, BigQuery, Snowflake, Redshift.

Cold tier (object storage): Parquet files in S3 or GCS, query via Athena or BigQuery external tables. Cheapest — $0.023/GB/month vs $0.10/GB for managed databases. Retrieval takes seconds for Athena queries or hours for Glacier. Best for data older than 2 years that is accessed less than once per month.

Archival Pipeline Design

Identifying Archive Candidates

Define an archival criterion per table: orders older than 2 years, events older than 90 days, audit logs older than 1 year. Use a timestamp column (created_at, completed_at) as the archival key. Avoid archiving based on row count alone — time-based archival is predictable and aligns with retention policies.

Batch Archival Process

Run archival as a scheduled job (nightly or weekly): (1) SELECT rows meeting the archival criterion in batches of 1000-10000 rows; (2) INSERT into the archive store (archive table, data warehouse, or S3 as Parquet); (3) verify the insert succeeded by checking row counts; (4) DELETE the archived rows from the primary table. Batch in small chunks to avoid long-running transactions that lock the table and degrade live query performance. Use cursor-based pagination to walk through candidates.

Idempotency

Archival jobs must be idempotent: if the job fails mid-run, re-running it should not duplicate archived rows or delete unarchived rows. Implement with a two-phase approach: mark rows as “archive_pending” first, then move them after confirmation. Or use the primary key as a deduplication key in the archive store — INSERT OR IGNORE in SQLite, INSERT … ON CONFLICT DO NOTHING in PostgreSQL.

Foreign Key Constraints

Archiving parent records while child records remain in the primary database breaks foreign key constraints. Options: archive parent and all children together atomically; use SET NULL on child foreign keys before archiving the parent; or archive in reverse dependency order (children first, then parents). For order/line-item relationships, archive the entire order with all line items in a single archival unit.

Querying Archived Data

Design a unified query layer that abstracts whether data is in the primary database or the archive: check primary first, fall back to archive if not found. For customer support tools, this means a support agent can look up a 3-year-old order without knowing it has been archived. Implement as a repository pattern: OrderRepository.findById() checks primary DB, then archive table, then S3 cold store — transparent to the caller.

GDPR and Archival

Archival does not satisfy GDPR right-to-erasure — archived data must also be deleted on erasure request. Design archival with erasure in mind: store PII in erasable columns rather than embedded in JSON blobs, maintain a record of which archive files or rows contain a given user’s data, and process erasure requests across primary and archive tiers. Anonymization (replacing PII with a placeholder) in the archive store may satisfy erasure requirements while preserving aggregate analytics.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What are the storage tiers in a data archival system?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Hot tier: primary database (millisecond access, highest cost) — keep data here while actively queried, typically the most recent 90-180 days. Warm tier: archive database or data warehouse (seconds to minutes, 10x cheaper) — suitable for compliance reporting and customer support lookups using tools like BigQuery or Redshift. Cold tier: object storage as Parquet files in S3 or GCS (cheapest at $0.023/GB/month, queried via Athena or BigQuery external tables, retrieval takes seconds to hours) — best for data older than 2 years accessed rarely. Data moves down tiers as it ages; retrieval SLAs loosen as the tier gets colder.”} }, { “@type”: “Question”, “name”: “How do you design an idempotent archival pipeline?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “An idempotent archival pipeline can be safely re-run if it fails mid-execution without duplicating archived rows or deleting unarchived ones. Implement with two phases: (1) mark rows as archive_pending (a status column or a staging table of IDs), (2) copy to archive store, verify by row count match, then delete from primary. If the job crashes after copying but before deleting, rerunning detects the archive_pending markers and skips re-copying (use INSERT ON CONFLICT DO NOTHING or CHECK EXISTS in the archive store). Use cursor-based pagination to process in batches of 1000-10000 rows rather than a single large transaction.”} }, { “@type”: “Question”, “name”: “How do you handle foreign key constraints when archiving parent records?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Archiving a parent record while its children remain in the primary database violates referential integrity. Options: (1) archive parent and all children together as an atomic unit (copy the complete object graph, then delete in reverse dependency order: children first, then parents); (2) nullify child foreign keys before archiving the parent (appropriate only if the child can exist without the parent); (3) archive in reverse dependency order within a transaction. For order/line-item relationships, archive the entire order plus its line items together. Design archival unit boundaries early — retrofitting them after the schema is built is painful.”} }, { “@type”: “Question”, “name”: “Does data archival satisfy GDPR right-to-erasure requirements?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “No — archival alone does not satisfy GDPR erasure. Archived data remains stored and is still personal data subject to Article 17. True erasure requires: deleting from the archive store (warm tier tables, cold tier Parquet files), deleting from backups within the retention period or anonymizing the backup, and processing erasure across all derived stores (analytics, search indexes, caches). Track which archive files or partitions contain a given user ID to enable targeted erasure. Consider anonymization as an alternative to deletion — replacing PII with a placeholder like [DELETED] satisfies erasure legally while preserving aggregate analytics and referential integrity in historical records.”} } ] }