Low Level Design: Content Archival Service

Content Archival Service: Storage Tiers

The archival service manages long-term content retention across four S3 storage tiers with different cost and retrieval characteristics.

Tier Definitions

Hot     S3 Standard              <1ms retrieval   high cost    active content
Warm    S3 Standard-IA           ms retrieval     medium cost  infrequent access
Cold    S3 Glacier Instant       ms retrieval     low cost     rare access
Archive S3 Glacier Deep Archive  12h retrieval    lowest cost  compliance storage

Core Schema

ArchiveRecord Table

id             BIGSERIAL PRIMARY KEY
content_id     BIGINT NOT NULL
content_type   TEXT NOT NULL          -- 'financial', 'medical', 'chat', etc.
content_hash   CHAR(64) NOT NULL      -- SHA256 hex
storage_tier   TEXT NOT NULL          -- hot/warm/cold/archive
s3_key         TEXT NOT NULL
size_bytes     BIGINT
archived_at    TIMESTAMPTZ DEFAULT now()
retention_until TIMESTAMPTZ
legal_hold     BOOL DEFAULT false
deleted_at     TIMESTAMPTZ            -- soft delete marker

RetrievalRequest Table

id                BIGSERIAL PRIMARY KEY
archive_record_id BIGINT NOT NULL REFERENCES archive_records(id)
requester         TEXT NOT NULL
status            TEXT DEFAULT 'pending'  -- pending/available/expired
requested_at      TIMESTAMPTZ DEFAULT now()
available_at      TIMESTAMPTZ             -- when restore completes
expires_at        TIMESTAMPTZ             -- availability window end

Retention Policies

Retention duration is configurable per content type. The retention_until timestamp is computed at archive time.

Retention rules (examples):
  financial  → 7 years
  medical    → 10 years
  chat       → 3 years
  legal      → 10 years (often with legal_hold=true)

-- At archive time:
retention_until = archived_at + INTERVAL per content_type rule

Lifecycle Tier Transitions

S3 Lifecycle policies automate tier transitions based on object age, eliminating manual migration jobs for the majority of content.

S3 Lifecycle rules on archive bucket:
  Day 0        → S3 Standard (hot)
  Day 30       → S3 Standard-IA (warm)
  Day 90       → S3 Glacier Instant Retrieval (cold)
  Day 365      → S3 Glacier Deep Archive (archive)

Application updates storage_tier in ArchiveRecord via S3 lifecycle
event notifications (EventBridge → Lambda → UPDATE archive_records).

Integrity Verification

A nightly job re-computes the SHA256 hash of each archived object and compares it to the stored content_hash. Mismatches trigger an alert and open an incident for investigation.

Nightly integrity job (pseudo-code):
FOR EACH record IN archive_records WHERE storage_tier IN ('hot','warm')
    stream = s3.get_object(record.s3_key)
    actual_hash = sha256(stream)
    IF actual_hash != record.content_hash:
        alert(severity='critical', record_id=record.id,
              expected=record.content_hash, actual=actual_hash)

-- Cold/archive tier: sample 1% nightly; full scan monthly (cost control)

Legal Hold

When legal_hold=true, the record is excluded from all deletion processes regardless of retention_until. Legal hold is set and cleared only through an authorized legal operations workflow with audit logging.

-- Deletion eligibility check:
SELECT * FROM archive_records
WHERE retention_until < now()
  AND legal_hold = false
  AND deleted_at IS NULL;

-- Legal hold prevents deletion even after retention expires.
-- Clearing hold requires explicit authorized action:
UPDATE archive_records SET legal_hold=false WHERE id=$1;
-- Audit log entry required for every legal_hold state change.

Retrieval SLA by Tier

Hot   (S3 Standard)         → immediate, synchronous response
Warm  (S3-IA)               → immediate, synchronous response
Cold  (Glacier Instant)     → ~1 minute, synchronous response
Archive (Glacier Deep)      → 12 hours, async restore + notify

Deep Archive Retrieval Flow

1. Client requests retrieval → INSERT INTO retrieval_requests (status='pending')
2. Trigger S3 Glacier restore: s3.restore_object(key, Days=2)
3. S3 fires RestoreCompleted event (EventBridge) after ~12h
4. Lambda handles event:
   UPDATE retrieval_requests SET status='available', available_at=now(),
       expires_at=now()+INTERVAL '48 hours' WHERE archive_record_id=...
5. Notify requester (email/webhook)
6. Client downloads restored object within 48h window

Search Before Retrieval

Requesters search archive metadata before initiating a retrieval to avoid unnecessary restore costs on deep archive objects.

-- Metadata search (no S3 cost):
SELECT id, content_type, content_hash, storage_tier, size_bytes,
       archived_at, retention_until, legal_hold
FROM archive_records
WHERE content_type=$1
  AND archived_at BETWEEN $2 AND $3
  AND deleted_at IS NULL;
-- Full-text on indexed metadata fields; no object retrieval until confirmed.

Deletion After Retention

A scheduled job identifies eligible records and hard-deletes both the S3 object and the database record, with a tombstone entry written to an audit log before deletion.

Deletion job (runs nightly):
1. SELECT records WHERE retention_until < now()
       AND legal_hold=false AND deleted_at IS NULL
2. FOR EACH record:
   a. Write tombstone to audit_log (id, content_hash, deleted_at, reason)
   b. s3.delete_object(record.s3_key)
   c. UPDATE archive_records SET deleted_at=now() WHERE id=record.id

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What storage tiers does a content archival service use?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A typical archival service uses four tiers: hot (S3 Standard, sub-millisecond retrieval), warm (S3 Standard-IA, millisecond retrieval), cold (S3 Glacier Instant Retrieval, millisecond retrieval), and archive (S3 Glacier Deep Archive, up to 12-hour retrieval). S3 Lifecycle policies automate transitions at 30, 90, and 365 days.”
}
},
{
“@type”: “Question”,
“name”: “How does integrity verification work in a content archival service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A nightly job streams each archived object from S3, computes its SHA256 hash, and compares it to the content_hash stored in the ArchiveRecord table at ingestion time. Any mismatch triggers a critical alert. For cost control, cold and deep archive tiers are sampled at 1% nightly with a full scan monthly.”
}
},
{
“@type”: “Question”,
“name”: “How does legal hold prevent accidental deletion of archived content?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When legal_hold=true on an ArchiveRecord, the deletion job's eligibility query explicitly excludes that record regardless of whether retention_until has passed. Clearing a legal hold requires an authorized action through a legal operations workflow with a mandatory audit log entry recording who cleared the hold and when.”
}
},
{
“@type”: “Question”,
“name”: “How does retrieval from deep archive work given the 12-hour restore time?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The client submits a retrieval request, which creates a RetrievalRequest row with status=pending and triggers an S3 Glacier restore job. When S3 fires a RestoreCompleted event after approximately 12 hours, a Lambda function updates the request to status=available and notifies the requester. The restored object remains accessible for 48 hours before expiring.”
}
}
]
}