KYC Service Low-Level Design: Identity Verification, Document Processing, and Compliance Workflow

What KYC Is and Why It's Hard

Know Your Customer (KYC) is the process by which a business verifies the identity of its customers, typically required by financial regulations (AML laws, FinCEN, FCA, etc.) before allowing access to financial services. The engineering challenge: KYC involves unstructured inputs (photos of documents, selfies), third-party provider dependencies, regulatory compliance requirements with strict audit trails, and a manual review workflow for edge cases that automated systems cannot resolve.

KYC Workflow States

Model KYC as a state machine to make transitions explicit and auditable:

UNVERIFIED: User registered but has not submitted documents
DOCUMENT_SUBMITTED: Documents uploaded, awaiting processing
PROCESSING: OCR extraction and third-party verification in progress
VERIFIED: Identity confirmed, user approved for full service
REJECTED: Identity could not be confirmed (document expired, face mismatch, sanctions hit)
MANUAL_REVIEW: Automated checks inconclusive, sent to compliance analyst queue

Every state transition is recorded in an immutable audit log with the actor (system, third-party provider, or human reviewer), timestamp, and reason code. No state can be modified retroactively.

Document Upload

Accepted document types: passport, national ID card, driver's license. Users upload images via a presigned S3 URL — the client uploads directly to S3, bypassing your application servers. This avoids large file uploads through your API tier and reduces latency. Server-side encryption (AES-256) is mandatory on the S3 bucket. Access to document images is restricted to the KYC processing service via IAM role, not broadly accessible.

Validate file format and size on upload (JPEG/PNG, max 10MB). Reject PDFs or HEIC formats that OCR tools handle poorly. Generate a unique document_id and store the S3 key, document type, and upload timestamp in the database.

OCR Extraction

Once uploaded, the document image is passed to an OCR service — AWS Textract, Google Document AI, or a third-party KYC provider. The OCR step extracts structured fields: full name, date of birth, document number, expiry date, issuing country. For ID cards with machine-readable zones (MRZ), parse the MRZ directly for higher accuracy. Store extracted fields alongside the original document record. Flag low-confidence extractions (OCR confidence score below threshold) for manual review rather than rejecting outright.

Liveness Check and Face Comparison

Document verification alone is insufficient — a fraudster could submit someone else's document. A liveness check requires the user to take a selfie (sometimes with head movement or blinking to prove it's not a static photo). The selfie is then compared against the document photo using a face comparison API (AWS Rekognition, Azure Face API, or embedded in KYC providers like Jumio). A similarity score above a threshold (e.g., 95%) passes the face match; below triggers manual review.

Third-Party Verification Providers

Providers like Jumio, Onfido, and Persona offer end-to-end identity verification as a service. You send them the document images and selfie via API; they return a verification decision and confidence score within seconds. Abstract the provider behind an internal interface so you can switch providers or run multiple providers in parallel for redundancy. Store the raw provider response and decision alongside your internal record — regulators may ask for this during audits.

Risk Scoring: PEP and Sanctions Screening

After identity is established, screen the person against risk databases. PEP (Politically Exposed Person) lists identify government officials and their associates who require enhanced due diligence. Sanctions lists (OFAC SDN, EU consolidated list, UN sanctions) identify individuals and entities you are legally prohibited from serving. Use a sanctions screening provider (Dow Jones, Refinitiv, ComplyAdvantage) that maintains up-to-date lists and provides fuzzy name matching to handle transliterations and aliases. A sanctions hit results in immediate rejection and often requires filing a Suspicious Activity Report (SAR) with the relevant financial intelligence unit.

Manual Review Queue

Cases that automated checks cannot resolve go to a compliance analyst dashboard. The dashboard shows: document images, OCR-extracted fields, provider decisions, risk scores, and the reason the case was escalated. Analysts have three actions: approve (transition to VERIFIED), reject (transition to REJECTED with reason code), or request additional documents (transition back to DOCUMENT_SUBMITTED with instructions to the user). Track analyst decisions with their user ID and timestamp for accountability and training data.

Data Retention and GDPR Compliance

Document images contain sensitive biometric and identity data. Under GDPR, data should be retained only as long as necessary. After a user is verified, delete the raw document images from S3 (retain only the extracted metadata: name, DOB, document type, expiry). Set a retention policy and a scheduled deletion job. For rejected or manually reviewed cases, retain images for the required regulatory period (typically 5–7 years for AML compliance), then delete on schedule. Build a data deletion workflow for right-to-erasure requests that respects regulatory retention minimums.

Re-verification Triggers

KYC is not a one-time event. Trigger re-verification when: a user's identity document expires, your risk threshold changes and existing users no longer meet the new standard, sanctions screening produces a new hit on a previously verified user, or a user changes their name or other identity attributes. Build re-verification as a first-class workflow — do not treat it as an edge case. Notify users in advance of required re-verification and provide a grace period before restricting access.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the stages in a KYC verification workflow?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A typical KYC pipeline progresses through document capture, document authenticity checks (MRZ parsing, hologram detection, expiry validation), identity data extraction (OCR of name/DOB/document number), biometric liveness check, face-match between the document photo and the live selfie, and finally sanctions/PEP/adverse-media screening against watchlists. Each stage emits a structured result that either advances the applicant to the next stage or routes them to manual review.”
}
},
{
“@type”: “Question”,
“name”: “How does liveness check prevent photo spoofing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Active liveness detection instructs the user to perform randomized head movements (blink, turn left, smile) and uses a depth-estimation model or 3D facial landmark tracker to confirm that the response is coming from a live face rather than a printed photo or a pre-recorded video replayed in front of the camera. Passive liveness models analyze texture and reflection artifacts to distinguish real skin from printed or digital reproductions without requiring user interaction.”
}
},
{
“@type”: “Question”,
“name”: “How is PII handled in KYC document processing for GDPR compliance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Document images and extracted PII fields are encrypted at rest with a per-user data key managed by a KMS, and access is logged to an immutable audit trail; retention policies automatically purge raw document images after the regulatory minimum holding period (typically 5 years for AML), while derived hashed identifiers used for de-duplication remain. Data minimization requires that downstream services receive only the verification outcome and risk score, never the raw document data.”
}
},
{
“@type”: “Question”,
“name”: “What triggers a manual review in an automated KYC pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Manual review is triggered when the automated confidence score for any stage falls below a configurable threshold — for example, face-match similarity below 0.85, OCR confidence below 0.90, or a liveness score in an ambiguous range — or when the applicant appears on a sanctions or PEP watchlist that requires human judgment about risk tolerance. High-risk jurisdictions and politically exposed persons are also routed to manual review regardless of automated score.”
}
}
]
}