How do you design an automated data collection pipeline for compliance reporting?

An automated data collection pipeline polls or subscribes to source systems—databases, audit logs, and third-party APIs—on a defined schedule, normalizes records into a canonical schema, and writes them to an append-only staging store. Each ingestion run records its source, timestamp, and row counts so gaps or duplicates can be detected. Data quality checks gate promotion to the reporting layer, ensuring only validated records influence compliance outputs.

How do you ensure reproducible report generation in a compliance system?

Reproducibility requires pinning every variable that influences output: the report logic is tagged with a git SHA at generation time, input datasets are referenced by an immutable snapshot ID, and dependency versions are locked in a manifest. Re-running the pipeline with the same SHA and snapshot ID must produce byte-identical output. The git SHA and snapshot ID are stored alongside the report artifact so any future audit can reconstruct the exact execution environment.

How should evidence artifacts be stored immutably in a compliance reporting system?

Evidence artifacts—raw exports, signed PDFs, and supporting logs—are written to object storage with Object Lock or WORM (Write Once Read Many) policies that prevent modification or deletion for a defined retention period. Each artifact is identified by a content-addressed hash stored in a metadata database alongside upload timestamp, uploader identity, and report run ID. This chain of custody satisfies auditor requirements and prevents tampering even by administrators.

How do you implement scheduled delivery with an audit trail in a compliance reporting system?

A scheduler triggers report generation jobs at configured intervals and records each execution in an audit log table with job ID, target recipients, scheduled time, actual dispatch time, and delivery status. Delivery receipts or API acknowledgments are appended to the same log row. If a delivery fails, the system retries with exponential backoff and logs each attempt separately. The audit trail is append-only and queryable so compliance officers can prove timely, complete distribution to regulators.

Compliance Reporting System Low-Level Design: Data Collection, Report Generation, and Evidence Storage

⏱ 6 min read

Requirements and Constraints

A compliance reporting system automates the collection, aggregation, and delivery of evidence required by regulatory frameworks such as SOC 2, PCI-DSS, HIPAA, and GDPR. Functional requirements include scheduled and ad-hoc data collection from internal systems (access logs, configuration snapshots, vulnerability scans), transformation into structured report formats, generation of PDF and machine-readable outputs, and storage of evidence artifacts with chain-of-custody metadata.

Non-functional constraints are demanding. Reports must be reproducible: re-running a report for the same period must produce bit-identical output. Evidence artifacts must be immutable once stored. Collection jobs must be idempotent — retries cannot create duplicate records. The system must support audit trails of who ran which report and when, and must enforce role-based access so auditors can read but never modify evidence.

Scale Assumptions

50–200 distinct control checks, each collecting data from 5–20 source systems
Scheduled collection: hourly for high-frequency controls, daily for most, quarterly for manual evidence
Report generation: up to 500-page PDFs with embedded evidence tables
Evidence artifact storage: terabytes per year, 7-year retention

Core Data Model

Four primary entities drive the schema:

-- Defines what to collect and from where
CREATE TABLE controls (
  id            UUID PRIMARY KEY,
  framework     VARCHAR(64) NOT NULL,   -- SOC2, PCI, HIPAA
  control_id    VARCHAR(64) NOT NULL,   -- e.g. CC6.1
  name          TEXT        NOT NULL,
  description   TEXT,
  collection_fn VARCHAR(256) NOT NULL,  -- handler identifier
  schedule      VARCHAR(64) NOT NULL,   -- cron expression
  UNIQUE (framework, control_id)
);

-- One row per collection run per control
CREATE TABLE evidence_records (
  id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  control_id     UUID NOT NULL REFERENCES controls(id),
  collected_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  period_start   TIMESTAMPTZ NOT NULL,
  period_end     TIMESTAMPTZ NOT NULL,
  collector_ver  VARCHAR(32) NOT NULL,
  raw_payload    JSONB       NOT NULL,
  artifact_s3_key TEXT,
  status         VARCHAR(32) NOT NULL,  -- pending, collected, failed
  content_hash   CHAR(64)    NOT NULL,
  collected_by   VARCHAR(128) NOT NULL  -- service account identity
);

-- Report definitions and generated instances
CREATE TABLE report_runs (
  id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  report_type    VARCHAR(64)  NOT NULL,
  period_start   TIMESTAMPTZ  NOT NULL,
  period_end     TIMESTAMPTZ  NOT NULL,
  requested_by   VARCHAR(128) NOT NULL,
  requested_at   TIMESTAMPTZ  NOT NULL DEFAULT now(),
  status         VARCHAR(32)  NOT NULL,
  output_s3_key  TEXT,
  evidence_snapshot JSONB     -- IDs and hashes at time of report generation
);

Data Collection Architecture

Collection is implemented as a plugin system. Each control references a collection_fn identifier that maps to a registered handler class. Handlers implement a standard interface: collect(period_start, period_end) -> EvidencePayload. Common handler types include:

API collectors: Query internal REST APIs (access management, SIEM, vulnerability scanner)
Database collectors: Run read-only SQL against source systems via dedicated service accounts
Config snapshot collectors: Pull infrastructure state from Terraform state files or cloud provider APIs
Log aggregators: Query Elasticsearch or Splunk for summarized log statistics

A scheduler (backed by PostgreSQL advisory locks or a distributed lock like Redis) triggers handlers on their cron schedule. Each run creates an evidence_records row with status pending, executes the handler, stores the raw payload as JSONB, uploads a canonical JSON rendering to S3 with server-side encryption, computes SHA256(canonical_json) as the content hash, and transitions status to collected. On failure the record transitions to failed with error details, and the scheduler retries with exponential backoff.

Report Generation Pipeline

When a report is requested, the pipeline executes in stages:

Evidence snapshot: Collect all evidence_records relevant to the report period. Record their IDs and content hashes in report_runs.evidence_snapshot. This snapshot is immutable — the report always refers to exactly this evidence even if newer collections arrive.
Transformation: Map raw evidence payloads through control-specific transformers that produce structured finding objects (pass/fail/partial, supporting data, remediation notes).
Template rendering: A report template engine (e.g., WeasyPrint for PDF, Jinja2 for HTML/XLSX) merges findings with narrative templates. Templates are versioned in git and referenced by SHA in the report run.
Artifact upload: The rendered PDF and machine-readable JSON are uploaded to S3 with object lock. The S3 key encodes the report type, period, and run ID for deterministic lookup.
Verification: A post-generation step downloads the artifact, recomputes its hash, and stores it in report_runs. Any future download can be verified against this hash.

API Design

POST /v1/reports             -- request report generation (async)
GET  /v1/reports/{id}       -- poll status, get download URL when ready
GET  /v1/controls           -- list controls with last collection status
POST /v1/controls/{id}/collect -- trigger ad-hoc collection
GET  /v1/evidence/{id}      -- retrieve evidence record with artifact URL
GET  /v1/evidence?control_id=&from=&to= -- list evidence for a control and period

Scalability and Reliability Considerations

Idempotency: Collection runs are keyed on (control_id, period_start, period_end) with a unique constraint. Duplicate triggers produce a no-op after checking for an existing collected record.
Parallelism: A worker pool runs up to N collection handlers concurrently, with per-source-system rate limiting to avoid overwhelming upstream APIs.
Report reproducibility: Pinning evidence by snapshot and template by git SHA guarantees identical output. Random seeds in any sampling must be seeded from the report run ID.
Access control: Auditor roles are read-only via IAM policies on S3 and RBAC on the API. Evidence artifact URLs are pre-signed with short TTLs (15 minutes).
Retention: S3 object lock (compliance mode) prevents deletion. A lifecycle rule transitions artifacts to Glacier after 1 year. Legal-hold workflow requires multi-approver sign-off before any lock is removed.