Compliance Reporting System Low-Level Design: Data Collection, Report Generation, and Evidence Storage

Requirements and Constraints

A compliance reporting system automates the collection, aggregation, and delivery of evidence required by regulatory frameworks such as SOC 2, PCI-DSS, HIPAA, and GDPR. Functional requirements include scheduled and ad-hoc data collection from internal systems (access logs, configuration snapshots, vulnerability scans), transformation into structured report formats, generation of PDF and machine-readable outputs, and storage of evidence artifacts with chain-of-custody metadata.

Non-functional constraints are demanding. Reports must be reproducible: re-running a report for the same period must produce bit-identical output. Evidence artifacts must be immutable once stored. Collection jobs must be idempotent — retries cannot create duplicate records. The system must support audit trails of who ran which report and when, and must enforce role-based access so auditors can read but never modify evidence.

Scale Assumptions

  • 50–200 distinct control checks, each collecting data from 5–20 source systems
  • Scheduled collection: hourly for high-frequency controls, daily for most, quarterly for manual evidence
  • Report generation: up to 500-page PDFs with embedded evidence tables
  • Evidence artifact storage: terabytes per year, 7-year retention

Core Data Model

Four primary entities drive the schema:

-- Defines what to collect and from where
CREATE TABLE controls (
  id            UUID PRIMARY KEY,
  framework     VARCHAR(64) NOT NULL,   -- SOC2, PCI, HIPAA
  control_id    VARCHAR(64) NOT NULL,   -- e.g. CC6.1
  name          TEXT        NOT NULL,
  description   TEXT,
  collection_fn VARCHAR(256) NOT NULL,  -- handler identifier
  schedule      VARCHAR(64) NOT NULL,   -- cron expression
  UNIQUE (framework, control_id)
);

-- One row per collection run per control
CREATE TABLE evidence_records (
  id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  control_id     UUID NOT NULL REFERENCES controls(id),
  collected_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  period_start   TIMESTAMPTZ NOT NULL,
  period_end     TIMESTAMPTZ NOT NULL,
  collector_ver  VARCHAR(32) NOT NULL,
  raw_payload    JSONB       NOT NULL,
  artifact_s3_key TEXT,
  status         VARCHAR(32) NOT NULL,  -- pending, collected, failed
  content_hash   CHAR(64)    NOT NULL,
  collected_by   VARCHAR(128) NOT NULL  -- service account identity
);

-- Report definitions and generated instances
CREATE TABLE report_runs (
  id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  report_type    VARCHAR(64)  NOT NULL,
  period_start   TIMESTAMPTZ  NOT NULL,
  period_end     TIMESTAMPTZ  NOT NULL,
  requested_by   VARCHAR(128) NOT NULL,
  requested_at   TIMESTAMPTZ  NOT NULL DEFAULT now(),
  status         VARCHAR(32)  NOT NULL,
  output_s3_key  TEXT,
  evidence_snapshot JSONB     -- IDs and hashes at time of report generation
);

Data Collection Architecture

Collection is implemented as a plugin system. Each control references a collection_fn identifier that maps to a registered handler class. Handlers implement a standard interface: collect(period_start, period_end) -> EvidencePayload. Common handler types include:

  • API collectors: Query internal REST APIs (access management, SIEM, vulnerability scanner)
  • Database collectors: Run read-only SQL against source systems via dedicated service accounts
  • Config snapshot collectors: Pull infrastructure state from Terraform state files or cloud provider APIs
  • Log aggregators: Query Elasticsearch or Splunk for summarized log statistics

A scheduler (backed by PostgreSQL advisory locks or a distributed lock like Redis) triggers handlers on their cron schedule. Each run creates an evidence_records row with status pending, executes the handler, stores the raw payload as JSONB, uploads a canonical JSON rendering to S3 with server-side encryption, computes SHA256(canonical_json) as the content hash, and transitions status to collected. On failure the record transitions to failed with error details, and the scheduler retries with exponential backoff.

Report Generation Pipeline

When a report is requested, the pipeline executes in stages:

  1. Evidence snapshot: Collect all evidence_records relevant to the report period. Record their IDs and content hashes in report_runs.evidence_snapshot. This snapshot is immutable — the report always refers to exactly this evidence even if newer collections arrive.
  2. Transformation: Map raw evidence payloads through control-specific transformers that produce structured finding objects (pass/fail/partial, supporting data, remediation notes).
  3. Template rendering: A report template engine (e.g., WeasyPrint for PDF, Jinja2 for HTML/XLSX) merges findings with narrative templates. Templates are versioned in git and referenced by SHA in the report run.
  4. Artifact upload: The rendered PDF and machine-readable JSON are uploaded to S3 with object lock. The S3 key encodes the report type, period, and run ID for deterministic lookup.
  5. Verification: A post-generation step downloads the artifact, recomputes its hash, and stores it in report_runs. Any future download can be verified against this hash.

API Design

POST /v1/reports             -- request report generation (async)
GET  /v1/reports/{id}       -- poll status, get download URL when ready
GET  /v1/controls           -- list controls with last collection status
POST /v1/controls/{id}/collect -- trigger ad-hoc collection
GET  /v1/evidence/{id}      -- retrieve evidence record with artifact URL
GET  /v1/evidence?control_id=&from=&to= -- list evidence for a control and period

Scalability and Reliability Considerations

  • Idempotency: Collection runs are keyed on (control_id, period_start, period_end) with a unique constraint. Duplicate triggers produce a no-op after checking for an existing collected record.
  • Parallelism: A worker pool runs up to N collection handlers concurrently, with per-source-system rate limiting to avoid overwhelming upstream APIs.
  • Report reproducibility: Pinning evidence by snapshot and template by git SHA guarantees identical output. Random seeds in any sampling must be seeded from the report run ID.
  • Access control: Auditor roles are read-only via IAM policies on S3 and RBAC on the API. Evidence artifact URLs are pre-signed with short TTLs (15 minutes).
  • Retention: S3 object lock (compliance mode) prevents deletion. A lifecycle rule transitions artifacts to Glacier after 1 year. Legal-hold workflow requires multi-approver sign-off before any lock is removed.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top