Requirements and Constraints
A compliance reporting system automates the collection, aggregation, and delivery of evidence required by regulatory frameworks such as SOC 2, PCI-DSS, HIPAA, and GDPR. Functional requirements include scheduled and ad-hoc data collection from internal systems (access logs, configuration snapshots, vulnerability scans), transformation into structured report formats, generation of PDF and machine-readable outputs, and storage of evidence artifacts with chain-of-custody metadata.
Non-functional constraints are demanding. Reports must be reproducible: re-running a report for the same period must produce bit-identical output. Evidence artifacts must be immutable once stored. Collection jobs must be idempotent — retries cannot create duplicate records. The system must support audit trails of who ran which report and when, and must enforce role-based access so auditors can read but never modify evidence.
Scale Assumptions
- 50–200 distinct control checks, each collecting data from 5–20 source systems
- Scheduled collection: hourly for high-frequency controls, daily for most, quarterly for manual evidence
- Report generation: up to 500-page PDFs with embedded evidence tables
- Evidence artifact storage: terabytes per year, 7-year retention
Core Data Model
Four primary entities drive the schema:
-- Defines what to collect and from where
CREATE TABLE controls (
id UUID PRIMARY KEY,
framework VARCHAR(64) NOT NULL, -- SOC2, PCI, HIPAA
control_id VARCHAR(64) NOT NULL, -- e.g. CC6.1
name TEXT NOT NULL,
description TEXT,
collection_fn VARCHAR(256) NOT NULL, -- handler identifier
schedule VARCHAR(64) NOT NULL, -- cron expression
UNIQUE (framework, control_id)
);
-- One row per collection run per control
CREATE TABLE evidence_records (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
control_id UUID NOT NULL REFERENCES controls(id),
collected_at TIMESTAMPTZ NOT NULL DEFAULT now(),
period_start TIMESTAMPTZ NOT NULL,
period_end TIMESTAMPTZ NOT NULL,
collector_ver VARCHAR(32) NOT NULL,
raw_payload JSONB NOT NULL,
artifact_s3_key TEXT,
status VARCHAR(32) NOT NULL, -- pending, collected, failed
content_hash CHAR(64) NOT NULL,
collected_by VARCHAR(128) NOT NULL -- service account identity
);
-- Report definitions and generated instances
CREATE TABLE report_runs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
report_type VARCHAR(64) NOT NULL,
period_start TIMESTAMPTZ NOT NULL,
period_end TIMESTAMPTZ NOT NULL,
requested_by VARCHAR(128) NOT NULL,
requested_at TIMESTAMPTZ NOT NULL DEFAULT now(),
status VARCHAR(32) NOT NULL,
output_s3_key TEXT,
evidence_snapshot JSONB -- IDs and hashes at time of report generation
);
Data Collection Architecture
Collection is implemented as a plugin system. Each control references a collection_fn identifier that maps to a registered handler class. Handlers implement a standard interface: collect(period_start, period_end) -> EvidencePayload. Common handler types include:
- API collectors: Query internal REST APIs (access management, SIEM, vulnerability scanner)
- Database collectors: Run read-only SQL against source systems via dedicated service accounts
- Config snapshot collectors: Pull infrastructure state from Terraform state files or cloud provider APIs
- Log aggregators: Query Elasticsearch or Splunk for summarized log statistics
A scheduler (backed by PostgreSQL advisory locks or a distributed lock like Redis) triggers handlers on their cron schedule. Each run creates an evidence_records row with status pending, executes the handler, stores the raw payload as JSONB, uploads a canonical JSON rendering to S3 with server-side encryption, computes SHA256(canonical_json) as the content hash, and transitions status to collected. On failure the record transitions to failed with error details, and the scheduler retries with exponential backoff.
Report Generation Pipeline
When a report is requested, the pipeline executes in stages:
- Evidence snapshot: Collect all
evidence_recordsrelevant to the report period. Record their IDs and content hashes inreport_runs.evidence_snapshot. This snapshot is immutable — the report always refers to exactly this evidence even if newer collections arrive. - Transformation: Map raw evidence payloads through control-specific transformers that produce structured finding objects (pass/fail/partial, supporting data, remediation notes).
- Template rendering: A report template engine (e.g., WeasyPrint for PDF, Jinja2 for HTML/XLSX) merges findings with narrative templates. Templates are versioned in git and referenced by SHA in the report run.
- Artifact upload: The rendered PDF and machine-readable JSON are uploaded to S3 with object lock. The S3 key encodes the report type, period, and run ID for deterministic lookup.
- Verification: A post-generation step downloads the artifact, recomputes its hash, and stores it in
report_runs. Any future download can be verified against this hash.
API Design
POST /v1/reports -- request report generation (async)
GET /v1/reports/{id} -- poll status, get download URL when ready
GET /v1/controls -- list controls with last collection status
POST /v1/controls/{id}/collect -- trigger ad-hoc collection
GET /v1/evidence/{id} -- retrieve evidence record with artifact URL
GET /v1/evidence?control_id=&from=&to= -- list evidence for a control and period
Scalability and Reliability Considerations
- Idempotency: Collection runs are keyed on
(control_id, period_start, period_end)with a unique constraint. Duplicate triggers produce a no-op after checking for an existingcollectedrecord. - Parallelism: A worker pool runs up to N collection handlers concurrently, with per-source-system rate limiting to avoid overwhelming upstream APIs.
- Report reproducibility: Pinning evidence by snapshot and template by git SHA guarantees identical output. Random seeds in any sampling must be seeded from the report run ID.
- Access control: Auditor roles are read-only via IAM policies on S3 and RBAC on the API. Evidence artifact URLs are pre-signed with short TTLs (15 minutes).
- Retention: S3 object lock (compliance mode) prevents deletion. A lifecycle rule transitions artifacts to Glacier after 1 year. Legal-hold workflow requires multi-approver sign-off before any lock is removed.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems