Low Level Design: Access Review Service

Problem Statement

Design an access review service that enables organizations to periodically certify user entitlements. The service must snapshot current entitlements, assign reviewers, manage the decision workflow (approve/revoke), automatically revoke access when reviewers do not respond by a deadline, and produce compliance audit reports for SOC 2, ISO 27001, and similar frameworks.

Requirements

Functional Requirements

Initiate an access review campaign targeting a set of users and/or permissions.
Snapshot user entitlements at campaign start time (immutable baseline for the review).
Assign each entitlement line item to a reviewer (manager, resource owner, or explicit assignment).
Accept reviewer decisions: approve, revoke, or escalate.
Auto-revoke entitlements that have not received a decision by the campaign deadline.
Propagate revoke decisions to the authoritative identity/access systems.
Generate per-campaign audit reports exportable as PDF or CSV.

Non-Functional Requirements

Support campaigns with up to 1 million entitlement line items.
Reviewer notification delivery within 2 minutes of campaign start.
Decision write latency < 100 ms P99.
Audit reports generated within 5 minutes of campaign close.
All entitlement snapshots and decisions retained for 7 years for compliance.

High-Level Architecture

Campaign Manager (API + scheduler)
       |
  Entitlement Snapshot Job
       |-- reads Identity Provider / RBAC systems
       |-- writes Snapshot Store (Postgres + S3)
       |
  Reviewer Assignment Engine
       |-- reads Manager Graph (HR system / LDAP)
       |-- writes ReviewItem table
       |
  Decision API (reviewer-facing)
       |
  Deadline Enforcer (cron / timer service)
       |-- auto-revokes on expiry
       |
  Provisioning Dispatcher
       |-- calls downstream IAM / RBAC APIs
       |
  Audit Report Generator
       |-- reads Snapshot + Decision tables
       |-- writes PDF/CSV to S3

Data Model

campaign

CREATE TABLE campaign (
  id              BIGSERIAL PRIMARY KEY,
  name            TEXT NOT NULL,
  status          TEXT NOT NULL    -- 'draft' | 'active' | 'closed' | 'cancelled'
                  CHECK (status IN ('draft','active','closed','cancelled')),
  scope           JSONB NOT NULL,  -- {users: [...], resources: [...], roles: [...]}
  deadline        TIMESTAMPTZ NOT NULL,
  auto_revoke     BOOLEAN NOT NULL DEFAULT true,
  created_by      BIGINT NOT NULL,
  created_at      TIMESTAMPTZ DEFAULT now(),
  closed_at       TIMESTAMPTZ
);

entitlement_snapshot

CREATE TABLE entitlement_snapshot (
  id              BIGSERIAL PRIMARY KEY,
  campaign_id     BIGINT REFERENCES campaign(id),
  user_id         TEXT NOT NULL,
  resource_id     TEXT NOT NULL,
  permission      TEXT NOT NULL,
  granted_at      TIMESTAMPTZ,
  granted_by      TEXT,
  snapshotted_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (campaign_id, user_id, resource_id, permission)
);

The snapshot is append-only and immutable after creation. It represents the state of the world at campaign start time, independent of any subsequent access changes. This is critical: a reviewer is certifying what existed at snapshot time, not the live state.

review_item

CREATE TABLE review_item (
  id              BIGSERIAL PRIMARY KEY,
  campaign_id     BIGINT REFERENCES campaign(id),
  snapshot_id     BIGINT REFERENCES entitlement_snapshot(id),
  reviewer_id     TEXT NOT NULL,
  status          TEXT NOT NULL    -- 'pending' | 'approved' | 'revoked' | 'escalated' | 'auto_revoked'
                  CHECK (status IN ('pending','approved','revoked','escalated','auto_revoked')),
  decided_at      TIMESTAMPTZ,
  decided_by      TEXT,
  comment         TEXT,
  escalated_to    TEXT,
  reminder_count  INT NOT NULL DEFAULT 0
);
CREATE INDEX ON review_item (campaign_id, reviewer_id, status);
CREATE INDEX ON review_item (campaign_id, status);

provisioning_action

CREATE TABLE provisioning_action (
  id              BIGSERIAL PRIMARY KEY,
  review_item_id  BIGINT REFERENCES review_item(id),
  action          TEXT NOT NULL    -- 'revoke' | 'approve_no_op'
                  CHECK (action IN ('revoke','approve_no_op')),
  target_system   TEXT NOT NULL,
  status          TEXT NOT NULL    -- 'pending' | 'completed' | 'failed' | 'retrying'
                  CHECK (status IN ('pending','completed','failed','retrying')),
  attempts        INT NOT NULL DEFAULT 0,
  last_error      TEXT,
  completed_at    TIMESTAMPTZ
);

Core Components

1. Entitlement Snapshot Job

When a campaign transitions from draft to active, the snapshot job runs synchronously (for small scopes, < 10,000 items) or as an async batch job (for larger scopes). The job:

Reads the campaign scope definition (target users, resources, roles).
Queries each registered entitlement source (LDAP groups, RBAC policy engine, OAuth scopes, AWS IAM role assignments) via registered adapters.
Deduplicates and writes rows to entitlement_snapshot in bulk with COPY or batched inserts.
Archives the full snapshot as a compressed Parquet file to S3 for long-term retention.
Transitions campaign status to active and triggers reviewer assignment.

Idempotency: the job uses a distributed lock (Redis SET NX) keyed on campaign_id to prevent duplicate snapshots on retry.

2. Reviewer Assignment Engine

Assignment follows a configurable priority order:

Explicit override — a campaign-level assignment rule mapping resource/role to a specific reviewer.
Resource owner — the owner registered in the resource catalog.
Manager — the direct manager of the user holding the entitlement, looked up from the HR/LDAP manager graph.
Fallback — a default reviewer queue (e.g., the security team) for orphaned resources.

For campaigns with millions of items, assignment runs as a Spark job that joins the snapshot table with the manager graph and resource catalog, then bulk-inserts into review_item.

After assignment, a Kafka event review.assigned is emitted per reviewer. The notification service consumes this stream and sends batched email/Slack digests — a reviewer with 200 items gets one digest, not 200 individual emails.

3. Decision API

The decision API is reviewer-facing and optimized for high throughput (bulk decisions are common).

POST /campaigns/{campaign_id}/decisions
Body: [
  { review_item_id: 123, decision: revoke, comment: no longer needed },
  { review_item_id: 124, decision: approved },
  ...
]

Per-request validations:

Caller must be the assigned reviewer or have the campaign:admin role.
Campaign must be in active status.
Item must be in pending or escalated status (decisions are idempotent if repeated with the same value).

Each accepted decision updates review_item.status and enqueues a provisioning action for revoke decisions. Approved decisions result in a no-op provisioning action (recorded for the audit trail).

4. Deadline Enforcer

A cron job runs every 5 minutes and queries:

SELECT id FROM campaign
WHERE status = 'active'
  AND deadline < now()
  AND auto_revoke = true;

For each expired campaign, a bulk update sets all pending review items to auto_revoked and enqueues revocation provisioning actions. The campaign is then transitioned to closed.

For campaigns with auto_revoke = false, the deadline enforcer instead escalates pending items to the fallback reviewer queue and sends alerts to campaign owners.

Reminder schedule: at 75% and 90% of the campaign window, the enforcer increments reminder_count and emits a review.reminder event to the notification service for all reviewers with pending items.

5. Provisioning Dispatcher

The provisioning dispatcher consumes the provisioning_action queue and executes revocations against downstream systems. Design priorities:

Idempotency — all provisioning calls use idempotency keys. If a revocation has already been applied (e.g., the user was already offboarded), the call succeeds silently.
Retry with backoff — transient failures (network timeouts, rate limits) are retried with exponential backoff up to 24 hours before escalating to an on-call alert.
Per-system adapters — each IAM system (Okta, AWS IAM, GitHub Org, in-house RBAC) has a registered adapter implementing a revoke(user_id, resource_id, permission) contract.
Partial failure handling — if a campaign has 10,000 revocations and 20 fail permanently, those 20 are flagged in the audit report as requiring manual remediation. The remaining 9,980 are marked completed.

6. Compliance Audit Report

The audit report generator produces per-campaign reports including:

Campaign metadata (scope, dates, reviewer assignments, deadline).
Summary statistics: total items, approved count, revoked count, auto-revoked count, completion rate by reviewer.
Per-item detail: user, resource, permission, reviewer, decision, timestamp, comment.
Provisioning action status for all revocations.
Exceptions: items that failed provisioning, items with escalation history.

Reports are generated as PDF (via a headless rendering service) and CSV, stored in S3 with a 7-year retention policy, and linked from the campaign record. Access to reports is restricted to users with the compliance:read role and logged in the access audit trail.

Scaling Considerations

Large Campaign Performance

For campaigns with millions of entitlement items, the reviewer-facing UI must paginate efficiently. The review_item table index on (campaign_id, reviewer_id, status) enables keyset pagination: the UI loads 50 items at a time using a cursor on id. Bulk-decision APIs allow reviewers to approve or revoke entire pages in one request.

Snapshot Immutability and Storage

Snapshots are never updated after creation. The Postgres table handles campaigns up to ~10 million rows efficiently; for larger campaigns, snapshots are stored only in S3 as Parquet files and queried via Athena for reporting. The Postgres table holds only a reference (S3 URI) for large snapshots.

Preventing Double-Revocation

If a user loses access via another path (offboarding, manual revocation) between snapshot time and provisioning execution, the adapter detects the no-op condition (permission already absent) and marks the action completed without error. This prevents false provisioning failures in audit reports.

Key Interview Discussion Points

Snapshot immutability: Reviewing against a snapshot rather than live state is essential. If access changes during the review window, the reviewer is still certifying the state at campaign start. This prevents gaming (revoking and re-granting to avoid review) and provides a clean audit baseline.
Auto-revoke vs. auto-escalate: Auto-revoke is the secure default for sensitive resources. For operational resources where accidental revocation is high-risk, auto-escalate to a human fallback is safer. The trade-off is compliance posture vs. availability risk.
Reviewer assignment quality: Poor assignment (reviewer doesn’t know the user) leads to rubber-stamp approvals. Strategies to improve quality: show last-login date and usage metrics alongside each entitlement, flag stale access (not used in 90 days), and require a comment for approve decisions on high-privilege entitlements.
Integration with IGA platforms: Production implementations often integrate with Identity Governance and Administration tools (SailPoint, Saviynt). The service described here is the custom core that can feed into or replace such tools, with the adapter pattern enabling incremental migration.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is an access review service and why is it needed for compliance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An access review service automates the periodic certification process in which resource owners or managers confirm that their team members’ permissions remain appropriate. Compliance frameworks such as SOC 2, ISO 27001, and HIPAA require organizations to demonstrate that access to sensitive systems is regularly reviewed and that unnecessary access is revoked. Without automation this process is a manual spreadsheet exercise that is error-prone and difficult to audit. The service generates review campaigns on a schedule, presents reviewers with a list of entitlements to approve or revoke, records every decision with a timestamp and reviewer identity, and enforces revocation when access is not certified within the deadline.”
}
},
{
“@type”: “Question”,
“name”: “How does an entitlement snapshot work in an access review service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “At campaign start the service takes a point-in-time snapshot of all entitlements in scope by querying connected identity providers, IAM systems, and SaaS integrations. The snapshot captures each (user, resource, permission level) triple along with metadata such as when the access was granted and by whom. Snapshotting is done atomically per system to avoid reviewing a moving target as access changes during the campaign. The snapshot is stored in the review database as the authoritative record of what must be certified. Any changes to live entitlements after snapshot time are tracked separately and may trigger a supplemental review item, but the primary campaign is evaluated against the snapshot to ensure a consistent audit trail.”
}
},
{
“@type”: “Question”,
“name”: “How is reviewer assignment handled in an access review workflow?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Reviewer assignment follows configurable rules: resource owners review who has access to their systems, people managers review their direct reports’ access, or data stewards review access to specific data classifications. The assignment engine resolves the current org chart and resource ownership metadata at campaign launch to generate reviewer-to-item mappings. When a reviewer is unavailable (out of office or departed) an escalation policy automatically reassigns items to the reviewer’s manager or a designated backup. The system sends reminder notifications as deadlines approach and provides a dashboard showing review completion rates per reviewer and business unit so campaign administrators can identify and chase stragglers.”
}
},
{
“@type”: “Question”,
“name”: “How does auto-revoke work when a reviewer does not respond to an access review?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each review item has a due date after which the service applies a configured default action. For security-sensitive resources the default is auto-revoke: the service calls the provisioning API of the connected system to remove the entitlement, logs the action with reason ‘reviewer non-response’, and notifies both the reviewer and the affected user. For lower-risk resources the default may be auto-approve with a flag for future audit. Auto-revoke is executed by an idempotent worker that processes expired items in batches, retries failed provisioning calls with exponential backoff, and records success or failure for each item. A rollback mechanism allows an administrator to reinstate access within a grace period if the revocation was erroneous.”
}
}
]
}