Content Archival Service: Storage Tiers
The archival service manages long-term content retention across four S3 storage tiers with different cost and retrieval characteristics.
Tier Definitions
Hot S3 Standard <1ms retrieval high cost active content
Warm S3 Standard-IA ms retrieval medium cost infrequent access
Cold S3 Glacier Instant ms retrieval low cost rare access
Archive S3 Glacier Deep Archive 12h retrieval lowest cost compliance storage
Core Schema
ArchiveRecord Table
id BIGSERIAL PRIMARY KEY
content_id BIGINT NOT NULL
content_type TEXT NOT NULL -- 'financial', 'medical', 'chat', etc.
content_hash CHAR(64) NOT NULL -- SHA256 hex
storage_tier TEXT NOT NULL -- hot/warm/cold/archive
s3_key TEXT NOT NULL
size_bytes BIGINT
archived_at TIMESTAMPTZ DEFAULT now()
retention_until TIMESTAMPTZ
legal_hold BOOL DEFAULT false
deleted_at TIMESTAMPTZ -- soft delete marker
RetrievalRequest Table
id BIGSERIAL PRIMARY KEY
archive_record_id BIGINT NOT NULL REFERENCES archive_records(id)
requester TEXT NOT NULL
status TEXT DEFAULT 'pending' -- pending/available/expired
requested_at TIMESTAMPTZ DEFAULT now()
available_at TIMESTAMPTZ -- when restore completes
expires_at TIMESTAMPTZ -- availability window end
Retention Policies
Retention duration is configurable per content type. The retention_until timestamp is computed at archive time.
Retention rules (examples):
financial → 7 years
medical → 10 years
chat → 3 years
legal → 10 years (often with legal_hold=true)
-- At archive time:
retention_until = archived_at + INTERVAL per content_type rule
Lifecycle Tier Transitions
S3 Lifecycle policies automate tier transitions based on object age, eliminating manual migration jobs for the majority of content.
S3 Lifecycle rules on archive bucket:
Day 0 → S3 Standard (hot)
Day 30 → S3 Standard-IA (warm)
Day 90 → S3 Glacier Instant Retrieval (cold)
Day 365 → S3 Glacier Deep Archive (archive)
Application updates storage_tier in ArchiveRecord via S3 lifecycle
event notifications (EventBridge → Lambda → UPDATE archive_records).
Integrity Verification
A nightly job re-computes the SHA256 hash of each archived object and compares it to the stored content_hash. Mismatches trigger an alert and open an incident for investigation.
Nightly integrity job (pseudo-code):
FOR EACH record IN archive_records WHERE storage_tier IN ('hot','warm')
stream = s3.get_object(record.s3_key)
actual_hash = sha256(stream)
IF actual_hash != record.content_hash:
alert(severity='critical', record_id=record.id,
expected=record.content_hash, actual=actual_hash)
-- Cold/archive tier: sample 1% nightly; full scan monthly (cost control)
Legal Hold
When legal_hold=true, the record is excluded from all deletion processes regardless of retention_until. Legal hold is set and cleared only through an authorized legal operations workflow with audit logging.
-- Deletion eligibility check:
SELECT * FROM archive_records
WHERE retention_until < now()
AND legal_hold = false
AND deleted_at IS NULL;
-- Legal hold prevents deletion even after retention expires.
-- Clearing hold requires explicit authorized action:
UPDATE archive_records SET legal_hold=false WHERE id=$1;
-- Audit log entry required for every legal_hold state change.
Retrieval SLA by Tier
Hot (S3 Standard) → immediate, synchronous response
Warm (S3-IA) → immediate, synchronous response
Cold (Glacier Instant) → ~1 minute, synchronous response
Archive (Glacier Deep) → 12 hours, async restore + notify
Deep Archive Retrieval Flow
1. Client requests retrieval → INSERT INTO retrieval_requests (status='pending')
2. Trigger S3 Glacier restore: s3.restore_object(key, Days=2)
3. S3 fires RestoreCompleted event (EventBridge) after ~12h
4. Lambda handles event:
UPDATE retrieval_requests SET status='available', available_at=now(),
expires_at=now()+INTERVAL '48 hours' WHERE archive_record_id=...
5. Notify requester (email/webhook)
6. Client downloads restored object within 48h window
Search Before Retrieval
Requesters search archive metadata before initiating a retrieval to avoid unnecessary restore costs on deep archive objects.
-- Metadata search (no S3 cost):
SELECT id, content_type, content_hash, storage_tier, size_bytes,
archived_at, retention_until, legal_hold
FROM archive_records
WHERE content_type=$1
AND archived_at BETWEEN $2 AND $3
AND deleted_at IS NULL;
-- Full-text on indexed metadata fields; no object retrieval until confirmed.
Deletion After Retention
A scheduled job identifies eligible records and hard-deletes both the S3 object and the database record, with a tombstone entry written to an audit log before deletion.
Deletion job (runs nightly):
1. SELECT records WHERE retention_until < now()
AND legal_hold=false AND deleted_at IS NULL
2. FOR EACH record:
a. Write tombstone to audit_log (id, content_hash, deleted_at, reason)
b. s3.delete_object(record.s3_key)
c. UPDATE archive_records SET deleted_at=now() WHERE id=record.id
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What storage tiers does a content archival service use?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A typical archival service uses four tiers: hot (S3 Standard, sub-millisecond retrieval), warm (S3 Standard-IA, millisecond retrieval), cold (S3 Glacier Instant Retrieval, millisecond retrieval), and archive (S3 Glacier Deep Archive, up to 12-hour retrieval). S3 Lifecycle policies automate transitions at 30, 90, and 365 days.”
}
},
{
“@type”: “Question”,
“name”: “How does integrity verification work in a content archival service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A nightly job streams each archived object from S3, computes its SHA256 hash, and compares it to the content_hash stored in the ArchiveRecord table at ingestion time. Any mismatch triggers a critical alert. For cost control, cold and deep archive tiers are sampled at 1% nightly with a full scan monthly.”
}
},
{
“@type”: “Question”,
“name”: “How does legal hold prevent accidental deletion of archived content?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When legal_hold=true on an ArchiveRecord, the deletion job's eligibility query explicitly excludes that record regardless of whether retention_until has passed. Clearing a legal hold requires an authorized action through a legal operations workflow with a mandatory audit log entry recording who cleared the hold and when.”
}
},
{
“@type”: “Question”,
“name”: “How does retrieval from deep archive work given the 12-hour restore time?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The client submits a retrieval request, which creates a RetrievalRequest row with status=pending and triggers an S3 Glacier restore job. When S3 fires a RestoreCompleted event after approximately 12 hours, a Lambda function updates the request to status=available and notifies the requester. The restored object remains accessible for 48 hours before expiring.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide