Data archival is the systematic movement of infrequently accessed data from primary storage to cheaper, slower storage while maintaining the ability to retrieve it. As databases grow, old data degrades query performance, increases backup costs, and inflates storage bills. A well-designed archival system removes this burden from the primary database while preserving access for compliance, analytics, and support use cases.
Archival vs. Deletion
Deletion removes data permanently. Archival moves data to a secondary store where it remains accessible but no longer affects primary database performance. Choose archival when: regulatory requirements mandate retention (financial records 7 years, medical records 10+ years), business intelligence needs historical data, or customer support may need to retrieve old records. Choose deletion when: data has no retention requirement, GDPR erasure is requested, or the data truly has zero future value.
Archive Storage Tiers
Storage cost versus retrieval speed trade-off drives tier selection:
Hot tier (primary database): Full query capability, millisecond access, highest cost. Keep data here while it is actively queried — typically the most recent 90-180 days depending on traffic patterns.
Warm tier (archive tables or data warehouse): Queryable via SQL, seconds-to-minutes access, 10x cheaper than primary. Suitable for compliance reporting, historical analytics, and customer support lookups. Options: separate PostgreSQL instance, BigQuery, Snowflake, Redshift.
Cold tier (object storage): Parquet files in S3 or GCS, query via Athena or BigQuery external tables. Cheapest — $0.023/GB/month vs $0.10/GB for managed databases. Retrieval takes seconds for Athena queries or hours for Glacier. Best for data older than 2 years that is accessed less than once per month.
Archival Pipeline Design
Identifying Archive Candidates
Define an archival criterion per table: orders older than 2 years, events older than 90 days, audit logs older than 1 year. Use a timestamp column (created_at, completed_at) as the archival key. Avoid archiving based on row count alone — time-based archival is predictable and aligns with retention policies.
Batch Archival Process
Run archival as a scheduled job (nightly or weekly): (1) SELECT rows meeting the archival criterion in batches of 1000-10000 rows; (2) INSERT into the archive store (archive table, data warehouse, or S3 as Parquet); (3) verify the insert succeeded by checking row counts; (4) DELETE the archived rows from the primary table. Batch in small chunks to avoid long-running transactions that lock the table and degrade live query performance. Use cursor-based pagination to walk through candidates.
Idempotency
Archival jobs must be idempotent: if the job fails mid-run, re-running it should not duplicate archived rows or delete unarchived rows. Implement with a two-phase approach: mark rows as “archive_pending” first, then move them after confirmation. Or use the primary key as a deduplication key in the archive store — INSERT OR IGNORE in SQLite, INSERT … ON CONFLICT DO NOTHING in PostgreSQL.
Foreign Key Constraints
Archiving parent records while child records remain in the primary database breaks foreign key constraints. Options: archive parent and all children together atomically; use SET NULL on child foreign keys before archiving the parent; or archive in reverse dependency order (children first, then parents). For order/line-item relationships, archive the entire order with all line items in a single archival unit.
Querying Archived Data
Design a unified query layer that abstracts whether data is in the primary database or the archive: check primary first, fall back to archive if not found. For customer support tools, this means a support agent can look up a 3-year-old order without knowing it has been archived. Implement as a repository pattern: OrderRepository.findById() checks primary DB, then archive table, then S3 cold store — transparent to the caller.
GDPR and Archival
Archival does not satisfy GDPR right-to-erasure — archived data must also be deleted on erasure request. Design archival with erasure in mind: store PII in erasable columns rather than embedded in JSON blobs, maintain a record of which archive files or rows contain a given user’s data, and process erasure requests across primary and archive tiers. Anonymization (replacing PII with a placeholder) in the archive store may satisfy erasure requirements while preserving aggregate analytics.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide
See also: Coinbase Interview Guide
See also: Shopify Interview Guide
See also: Snap Interview Guide
See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems