Account Integrity Service Overview
The Account Integrity Service protects user accounts from takeover attacks, credential stuffing, and unauthorized access. It combines real-time behavioral signals, device fingerprinting, and credential breach data to detect anomalous access events, triggers step-up authentication challenges when risk is elevated, and provides a secure, audited recovery flow when accounts are compromised.
Requirements
Functional Requirements
- Evaluate account takeover (ATO) risk for every login, sensitive action, and session token refresh event.
- Trigger step-up authentication (SMS OTP, TOTP, passkey assertion) when risk exceeds a configurable threshold.
- Lock accounts and initiate a recovery flow when high-confidence ATO signals are detected.
- Provide a secure account recovery path: identity verification, credential reset, active session revocation, and notification.
- Emit a tamper-evident audit log of all integrity events for security and compliance teams.
Non-Functional Requirements
- ATO risk evaluation within 50 ms at p99 on the critical login path.
- No single point of failure; the service degrades gracefully to a conservative allow-with-logging policy if the risk model is unavailable.
- Audit log immutability enforced via append-only storage with cryptographic chaining.
Data Model
The AccessEvent record captures: event_id UUID, account_id, event_type ENUM (login, password-change, mfa-add, session-refresh), ip_address, device_fingerprint, user_agent, geo_country, timestamp, and auth_method ENUM.
The AccountRiskState record (stored in Redis, TTL 30 days) holds: account_id, risk_score FLOAT, risk_factors LIST, last_known_device_ids SET, last_known_ips SET, lockout_state ENUM (none, soft-locked, hard-locked), and step_up_challenge_pending BOOL.
The AuditEntry table is append-only: entry_id UUID, account_id, event_type, actor, payload JSON, timestamp, prev_hash CHAR(64), entry_hash CHAR(64). Each entry hashes its predecessor, forming a tamper-evident chain.
Core Algorithms
Account Takeover Detection
The risk model combines: (1) a device and IP novelty score — whether the fingerprint and IP have been seen for this account before; (2) a velocity score — login attempts per hour from new locations; (3) a credential breach signal — whether the submitted credential pair appears in a known breach database (checked via k-anonymity prefix query against a breach API); and (4) a behavioral biometrics anomaly score for flows where keystroke or mouse dynamics data is available. Signals are combined via a gradient-boosted classifier (XGBoost) producing a score in [0, 1]. The model is evaluated in-process from a preloaded artifact to meet the 50 ms budget.
Step-Up Authentication Trigger
A threshold policy maps risk score ranges to challenge types: scores above 0.4 trigger an SMS OTP; above 0.7 require a TOTP or passkey. The challenge is issued by returning a 401 response with a WWW-Authenticate: StepUp challenge_token=... header. The challenge token is a signed JWT (HMAC-SHA256) encoding the required factor and a short TTL (5 minutes). Successful challenge completion reduces the account risk score and marks the device as trusted.
Recovery Flow
Account recovery follows a multi-step protocol: (1) identity verification via a recovery code, backup email OTP, or identity document upload; (2) all active sessions are revoked by incrementing a session generation counter in the auth token service; (3) credentials are reset and re-encryption of any account-level secrets is triggered; (4) a recovery audit record is written and an out-of-band notification (email plus SMS if available) is sent to the account holder.
API Design
- EvaluateAccessEvent(AccessEvent) → RiskDecision — called by the auth service on every login; returns allow, challenge, or deny with risk metadata.
- IssueStepUpChallenge(AccountId, ChallengeType) → ChallengeToken — issues a signed step-up challenge token.
- VerifyStepUpResponse(ChallengeToken, Response) → VerificationResult — validates the user response to the step-up challenge.
- InitiateRecovery(RecoveryRequest) → RecoverySessionId — starts a recovery session; returns a session ID for subsequent verification steps.
- GetAuditLog(AccountId, TimeRange) → AuditEntries — returns the tamper-evident audit history for an account.
Scalability and Fault Tolerance
The risk evaluation path is stateless beyond the Redis account state read, enabling horizontal scaling. Redis is configured with read replicas in each region; the primary handles writes, replicas serve the hot read path. On Redis unavailability, the service falls back to a conservative policy: new devices trigger a step-up challenge, known devices are allowed with an elevated-risk log entry.
The ATO detection model is loaded from object storage at startup. Hot-swap works the same way as other ML services: a new artifact triggers a background load; once validation passes (test inputs produce expected outputs), the model is atomically promoted. The previous model remains in memory as fallback for 60 seconds.
Monitoring
- Track step-up challenge issuance rate, completion rate, and abandonment rate; high abandonment may indicate friction or false positives.
- Alert on account lockout rates spiking above baseline — could signal a credential stuffing wave or a model miscalibration.
- Monitor breach database query latency; alert if p99 exceeds 20 ms to ensure the risk evaluation SLO is not at risk.
- Publish weekly integrity reports: ATOs detected, step-up challenges issued, accounts recovered, and false-positive estimates from analyst reviews.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering