Question 1

What are the storage tiers in a data archival system?

Accepted Answer

Hot tier: primary database (millisecond access, highest cost) — keep data here while actively queried, typically the most recent 90-180 days. Warm tier: archive database or data warehouse (seconds to minutes, 10x cheaper) — suitable for compliance reporting and customer support lookups using tools like BigQuery or Redshift. Cold tier: object storage as Parquet files in S3 or GCS (cheapest at $0.023/GB/month, queried via Athena or BigQuery external tables, retrieval takes seconds to hours) — best for data older than 2 years accessed rarely. Data moves down tiers as it ages; retrieval SLAs loosen as the tier gets colder.

Question 2

How do you design an idempotent archival pipeline?

Accepted Answer

An idempotent archival pipeline can be safely re-run if it fails mid-execution without duplicating archived rows or deleting unarchived ones. Implement with two phases: (1) mark rows as archive_pending (a status column or a staging table of IDs), (2) copy to archive store, verify by row count match, then delete from primary. If the job crashes after copying but before deleting, rerunning detects the archive_pending markers and skips re-copying (use INSERT ON CONFLICT DO NOTHING or CHECK EXISTS in the archive store). Use cursor-based pagination to process in batches of 1000-10000 rows rather than a single large transaction.

Question 3

How do you handle foreign key constraints when archiving parent records?

Accepted Answer

Archiving a parent record while its children remain in the primary database violates referential integrity. Options: (1) archive parent and all children together as an atomic unit (copy the complete object graph, then delete in reverse dependency order: children first, then parents); (2) nullify child foreign keys before archiving the parent (appropriate only if the child can exist without the parent); (3) archive in reverse dependency order within a transaction. For order/line-item relationships, archive the entire order plus its line items together. Design archival unit boundaries early — retrofitting them after the schema is built is painful.

Question 4

Does data archival satisfy GDPR right-to-erasure requirements?

Accepted Answer

No — archival alone does not satisfy GDPR erasure. Archived data remains stored and is still personal data subject to Article 17. True erasure requires: deleting from the archive store (warm tier tables, cold tier Parquet files), deleting from backups within the retention period or anonymizing the backup, and processing erasure across all derived stores (analytics, search indexes, caches). Track which archive files or partitions contain a given user ID to enable targeted erasure. Consider anonymization as an alternative to deletion — replacing PII with a placeholder like [DELETED] satisfies erasure legally while preserving aggregate analytics and referential integrity in historical records.

Data Archival Strategy: Low-Level Design

Archival vs. Deletion

Archive Storage Tiers

Archival Pipeline Design

Identifying Archive Candidates

Batch Archival Process

Idempotency

Foreign Key Constraints

Querying Archived Data