Account Integrity Service Low-Level Design: Takeover Detection, Recovery Flow, and Step-Up Auth

Account Integrity Service Overview

The Account Integrity Service protects user accounts from takeover attacks, credential stuffing, and unauthorized access. It combines real-time behavioral signals, device fingerprinting, and credential breach data to detect anomalous access events, triggers step-up authentication challenges when risk is elevated, and provides a secure, audited recovery flow when accounts are compromised.

Requirements

Functional Requirements

Evaluate account takeover (ATO) risk for every login, sensitive action, and session token refresh event.
Trigger step-up authentication (SMS OTP, TOTP, passkey assertion) when risk exceeds a configurable threshold.
Lock accounts and initiate a recovery flow when high-confidence ATO signals are detected.
Provide a secure account recovery path: identity verification, credential reset, active session revocation, and notification.
Emit a tamper-evident audit log of all integrity events for security and compliance teams.

Non-Functional Requirements

ATO risk evaluation within 50 ms at p99 on the critical login path.
No single point of failure; the service degrades gracefully to a conservative allow-with-logging policy if the risk model is unavailable.
Audit log immutability enforced via append-only storage with cryptographic chaining.

Data Model

The AccessEvent record captures: event_id UUID, account_id, event_type ENUM (login, password-change, mfa-add, session-refresh), ip_address, device_fingerprint, user_agent, geo_country, timestamp, and auth_method ENUM.

The AccountRiskState record (stored in Redis, TTL 30 days) holds: account_id, risk_score FLOAT, risk_factors LIST, last_known_device_ids SET, last_known_ips SET, lockout_state ENUM (none, soft-locked, hard-locked), and step_up_challenge_pending BOOL.

The AuditEntry table is append-only: entry_id UUID, account_id, event_type, actor, payload JSON, timestamp, prev_hash CHAR(64), entry_hash CHAR(64). Each entry hashes its predecessor, forming a tamper-evident chain.

Core Algorithms

Account Takeover Detection

The risk model combines: (1) a device and IP novelty score — whether the fingerprint and IP have been seen for this account before; (2) a velocity score — login attempts per hour from new locations; (3) a credential breach signal — whether the submitted credential pair appears in a known breach database (checked via k-anonymity prefix query against a breach API); and (4) a behavioral biometrics anomaly score for flows where keystroke or mouse dynamics data is available. Signals are combined via a gradient-boosted classifier (XGBoost) producing a score in [0, 1]. The model is evaluated in-process from a preloaded artifact to meet the 50 ms budget.

Step-Up Authentication Trigger

A threshold policy maps risk score ranges to challenge types: scores above 0.4 trigger an SMS OTP; above 0.7 require a TOTP or passkey. The challenge is issued by returning a 401 response with a WWW-Authenticate: StepUp challenge_token=... header. The challenge token is a signed JWT (HMAC-SHA256) encoding the required factor and a short TTL (5 minutes). Successful challenge completion reduces the account risk score and marks the device as trusted.

Recovery Flow

Account recovery follows a multi-step protocol: (1) identity verification via a recovery code, backup email OTP, or identity document upload; (2) all active sessions are revoked by incrementing a session generation counter in the auth token service; (3) credentials are reset and re-encryption of any account-level secrets is triggered; (4) a recovery audit record is written and an out-of-band notification (email plus SMS if available) is sent to the account holder.

API Design

EvaluateAccessEvent(AccessEvent) → RiskDecision — called by the auth service on every login; returns allow, challenge, or deny with risk metadata.
IssueStepUpChallenge(AccountId, ChallengeType) → ChallengeToken — issues a signed step-up challenge token.
VerifyStepUpResponse(ChallengeToken, Response) → VerificationResult — validates the user response to the step-up challenge.
InitiateRecovery(RecoveryRequest) → RecoverySessionId — starts a recovery session; returns a session ID for subsequent verification steps.
GetAuditLog(AccountId, TimeRange) → AuditEntries — returns the tamper-evident audit history for an account.

Scalability and Fault Tolerance

The risk evaluation path is stateless beyond the Redis account state read, enabling horizontal scaling. Redis is configured with read replicas in each region; the primary handles writes, replicas serve the hot read path. On Redis unavailability, the service falls back to a conservative policy: new devices trigger a step-up challenge, known devices are allowed with an elevated-risk log entry.

The ATO detection model is loaded from object storage at startup. Hot-swap works the same way as other ML services: a new artifact triggers a background load; once validation passes (test inputs produce expected outputs), the model is atomically promoted. The previous model remains in memory as fallback for 60 seconds.

Monitoring

Track step-up challenge issuance rate, completion rate, and abandonment rate; high abandonment may indicate friction or false positives.
Alert on account lockout rates spiking above baseline — could signal a credential stuffing wave or a model miscalibration.
Monitor breach database query latency; alert if p99 exceeds 20 ms to ensure the risk evaluation SLO is not at risk.
Publish weekly integrity reports: ATOs detected, step-up challenges issued, accounts recovered, and false-positive estimates from analyst reviews.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is XGBoost used for account takeover (ATO) detection?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An XGBoost classifier is trained on labeled login events (legitimate vs. ATO) using features like device fingerprint match, IP geolocation velocity, typing cadence, and time-since-last-login. The model outputs a real-time risk score at each authentication event. Scores above a threshold trigger a step-up challenge; very high scores block the attempt and alert the account owner.”
}
},
{
“@type”: “Question”,
“name”: “What is a step-up authentication challenge in account integrity?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A step-up challenge requires the user to prove identity through an additional factor (SMS OTP, authenticator app TOTP, email link, or passkey) when the risk score exceeds a configured threshold. It is 'step-up' because it only activates for risky sessions, avoiding friction for low-risk logins. Challenge outcomes feed back into the risk model as a training signal.”
}
},
{
“@type”: “Question”,
“name”: “How is a cryptographically chained audit log implemented for account actions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each audit log entry includes a hash of the previous entry's content, forming a chain similar to a blockchain. Any tampering with a historical record breaks the chain and is detectable during verification. Entries are append-only, stored in a write-once store, and periodically checkpointed with a signature from a hardware security module (HSM).”
}
},
{
“@type”: “Question”,
“name”: “How does session revocation work in an account integrity system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When an ATO is confirmed, all active sessions for the account are invalidated by publishing their session tokens to a blocklist (e.g., Redis set with TTL matching token lifetime). At each API request, the auth middleware checks the blocklist before honoring the token. The account owner is notified and guided through a credential reset flow before new sessions can be created.”
}
}
]
}