Risk Scoring Service Low-Level Design: Multi-Signal Aggregation, Model Ensemble, and Score Explanation

Requirements and Constraints

A risk scoring service aggregates signals from multiple data sources, runs them through an ensemble of models, produces a calibrated composite score, and generates a human-readable explanation of the score's key drivers. It is consumed by credit underwriting, fraud review, and onboarding flows. Functional requirements: multi-signal feature aggregation (bureau data, behavioral data, device signals, transactional history), model ensemble with configurable weights, calibrated probability output, and explainability export. Non-functional: P99 response under 500ms, support 500 concurrent scoring requests, return consistent scores for the same input (deterministic), and maintain a full audit log per score for regulatory lookups.

Core Data Model

score_requests(request_id PK UUID, entity_type ENUM('user','business'), entity_id, purpose ENUM('credit','fraud','onboarding'), requested_by, created_at)
score_results(request_id FK PK, composite_score FLOAT, risk_tier ENUM('low','medium','high','very_high'), model_scores JSONB, feature_values JSONB, explanation JSONB, model_ensemble_version, created_at)
model_registry(model_id PK, name, version, type ENUM('lgbm','logistic','neural'), purpose, onnx_artifact_path, weight FLOAT, calibration_params JSONB, deployed_at, active BOOL)
feature_sources(source_id PK, name, signal_type, fetch_method ENUM('sync_http','cache','batch'), timeout_ms, required BOOL)
entity_feature_cache(entity_id, feature_source_id, feature_values JSONB, computed_at, expires_at)
calibration_models(model_id FK, method ENUM('platt','isotonic'), params JSONB, trained_at)

Multi-Signal Feature Aggregation

Feature signals are fetched in parallel with a configurable timeout per source. Signal types include: credit bureau trade lines and derogatory marks (synchronous HTTP to bureau API), internal transaction behavioral features (pre-computed, served from the feature cache), device reputation and identity signals (sync HTTP to device intelligence vendor), application data submitted by the user (passed inline in the request), and social/network graph features (async pre-computed batch). The aggregation layer uses a scatter-gather pattern: all required sources are fetched concurrently; the response waits for required sources (up to their timeout) and proceeds with available optional sources.

Fetched features are normalized to a canonical schema defined per model. Feature engineering transformations (log scaling, binning, one-hot encoding) are defined as a versioned transformation pipeline co-deployed with each model, ensuring the feature representation at inference matches what the model was trained on.

Model Ensemble

Multiple models are combined using a weighted average ensemble. Each model in model_registry has an assigned weight; weights sum to 1.0. Each model receives the same feature vector and produces a raw probability output. The ensemble output is: composite_raw = SUM(model.weight * model.raw_score). Weights are managed in the database and can be adjusted without code deployment, enabling gradual rollout of new models by increasing their weight incrementally (shadow mode at weight=0, canary at weight=0.1, then ramp).

Calibration corrects for systematic over- or under-confidence in raw model outputs. Platt scaling (logistic regression on model output vs. true label) and isotonic regression are the two supported methods, with calibration parameters stored per model. The calibration_models table stores the fitted parameters; calibration is applied in the scoring service after raw inference, before ensemble aggregation.

Score Explanation

Explanation is generated using SHAP (SHapley Additive exPlanations) values computed at inference time for tree models, or approximated using LIME for neural models. The explanation JSONB field stores the top-10 features by absolute SHAP value, including the feature name, the entity's value for that feature, the population median for context, and the SHAP contribution (signed float). This structure supports: adverse action notices (which negative features most impacted the score), analyst review panels, and regulatory inquiry responses. Explanation generation adds approximately 10-20ms overhead for tree-based SHAP — acceptable within the 500ms budget.

Calibration and Monitoring

Score calibration quality degrades over time as population distributions shift (model drift). A daily calibration job evaluates the Brier score and Expected Calibration Error (ECE) on the rolling 30-day labeled dataset. If ECE exceeds a threshold, an alert fires and the calibration parameters are retrained on the latest data. The model_ensemble_version field in score_results allows post-hoc analysis of score quality by ensemble cohort.

Scalability Considerations

Feature cache: Pre-computing expensive features (bureau queries cost $0.05-$0.50 each) and caching them for 24-48 hours for the same entity reduces both latency and cost. Cache population is triggered on first miss or via a nightly batch refresh for high-activity entities.
Model artifact loading: ONNX models are loaded into shared memory at process startup. For large ensembles, use process-level model pools with thread-safe ONNX Runtime InferenceSession objects.
Determinism: Fixed random seeds, deterministic ONNX execution providers, and feature snapshot storage (feature_values in score_results) ensure the score can be exactly reproduced for audit purposes.
Horizontal scaling: The scoring service is stateless; scale horizontally. Route by entity_id for cache locality if the feature cache is in-process rather than external.

API Design

POST /scores — primary scoring endpoint; accepts entity_id, entity_type, purpose, and inline features; returns composite score, tier, model breakdown, and explanation
GET /scores/{request_id} — retrieve stored score result with full feature snapshot for audit
GET /entities/{id}/score-history — time series of scores for an entity, useful for trend analysis
POST /models/{id}/activate — add a model to the active ensemble with specified weight
GET /models/{id}/calibration-report — current Brier score and ECE for a model over rolling window
POST /features/precompute — batch endpoint to trigger feature pre-computation for a list of entity IDs

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does scatter-gather feature aggregation work in a risk scoring system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a scoring request arrives, the orchestrator fans out parallel calls to multiple feature stores (user history, device fingerprint, transaction velocity, network graph). Each store returns its feature slice within a timeout budget; stores that miss the deadline contribute a default or cached value rather than blocking the response. The gathered slices are merged into a single feature vector. This scatter-gather pattern keeps p99 latency bounded to the slowest acceptable store rather than the sum of all stores, and each feature store can be scaled and cached independently.”
}
},
{
“@type”: “Question”,
“name”: “How is a model ensemble with calibrated scores implemented for fraud risk?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An ensemble combines predictions from specialist models (e.g., a gradient-boosted tree for transaction patterns, a neural network for behavioral sequences, a rule engine for hard blocks). Each model outputs a raw score that is passed through a Platt scaling or isotonic regression calibrator to convert it to a probability. The calibrated probabilities are combined via a learned meta-model (stacking) or a weighted average tuned on a holdout set. Calibration ensures that a score of 0.8 corresponds to roughly 80% actual fraud rate, making thresholds interpretable by operations teams.”
}
},
{
“@type”: “Question”,
“name”: “How is SHAP explainability output generated for risk scores?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SHAP (SHapley Additive exPlanations) values are computed per prediction by attributing the difference between the model's output and the baseline (expected) score across each feature. For tree-based models, TreeSHAP computes exact values in polynomial time without sampling. The top-N features by absolute SHAP value are stored alongside the score and surfaced in the dispute console so analysts can understand why a transaction was flagged. For the ensemble, SHAP values from each model are weighted by that model's contribution to the final score before aggregation.”
}
},
{
“@type”: “Question”,
“name”: “How is score versioning implemented for risk auditability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every scoring invocation records an immutable audit record containing: score_version (model registry ID + git SHA), input_feature_vector (or a content-addressed hash pointing to a feature snapshot), output_score, threshold_applied, decision, and timestamp. Records are written to an append-only store (e.g., an immutable S3 prefix or an insert-only audit table). When a dispute is raised months later, the system can replay the exact feature snapshot through the pinned model version to reproduce the original score, satisfying regulatory requirements for explainability and model governance.”
}
}
]
}