Capacity Planning Service: Overview and Requirements
A capacity planning service continuously monitors infrastructure resource consumption, projects future demand using historical time-series data, alerts when projected usage will breach headroom thresholds, and triggers automated provisioning workflows before a shortage occurs. It serves platform and infrastructure engineering teams who need to stay ahead of growth without over-provisioning.
Functional Requirements
- Ingest raw resource metrics (CPU, memory, disk, network, custom) from monitoring systems via push or pull.
- Compute demand projections for configurable horizons such as 7, 30, and 90 days.
- Evaluate headroom thresholds: alert when projected usage will exceed a fraction of available capacity within the planning horizon.
- Trigger provisioning workflows automatically when thresholds are breached, with human approval gates configurable per resource type.
- Expose a dashboard API for current utilization, projections, and historical capacity events.
Non-Functional Requirements
- Metric ingestion must handle up to 100,000 time series at 60-second resolution without data loss.
- Projection computation must complete within 60 seconds of new data arriving for all monitored resources.
- Alert delivery must occur within 5 minutes of a threshold breach being detected.
- The system must retain raw metrics for 90 days and downsampled data for 2 years.
Data Model
- MetricSeries: series_id, resource_id, resource_type (cpu | memory | disk | custom), unit, labels (key-value map), retention_policy, created_at.
- MetricPoint: series_id, timestamp, value — stored in a columnar time-series database such as TimescaleDB or ClickHouse partitioned by time.
- CapacityConfig: config_id, resource_id, planning_horizon_days, headroom_threshold (fraction 0-1), projection_model (linear | exponential | seasonal), approval_required (bool), provisioning_workflow_id.
- Projection: projection_id, series_id, computed_at, horizon_days, model_type, projected_values (time-series JSON), confidence_interval_lower, confidence_interval_upper, breach_date (nullable).
- Alert: alert_id, series_id, config_id, triggered_at, projected_breach_date, current_utilization, threshold, status (open | acknowledged | resolved), resolved_at.
- ProvisioningEvent: event_id, alert_id, workflow_id, status (pending_approval | approved | executing | completed | failed), requested_at, completed_at, provisioned_units.
Metric Ingestion Pipeline
Metrics arrive via two paths: a pull collector that scrapes Prometheus-compatible endpoints on a configurable interval, and a push receiver that accepts OpenTelemetry OTLP payloads. Both paths write to a Kafka topic partitioned by series_id, providing backpressure and durability.
A stream processor consumes from Kafka and writes batches to the time-series store using bulk insert APIs. Downsampling runs as a continuous aggregate job that computes hourly and daily rollups from the raw 60-second data, reducing storage and accelerating historical queries used by projection models.
Projection Models
Linear Regression
For resources with steady growth trends, the service fits a least-squares linear model over a training window of 30 days. The slope gives the daily growth rate. The projection extrapolates this slope over the planning horizon with a confidence interval derived from the residual standard error.
Exponential Smoothing
For resources with accelerating growth such as user data storage, an exponential smoothing model applies higher weight to recent observations. The smoothing factor alpha is tuned per series using cross-validation on held-out recent data.
Seasonal Decomposition
Many infrastructure resources exhibit weekly seasonality: higher load on weekdays, lower on weekends. The service applies STL decomposition (Seasonal and Trend decomposition using Loess) to separate trend, seasonal, and residual components. Projections are made on the trend component and then seasonal patterns are re-added, producing more accurate short-term forecasts for cyclical workloads.
Threshold Alerting
After each projection run, the service evaluates CapacityConfig records to find the date at which the projected upper confidence bound will reach the headroom threshold of available capacity. If that breach date falls within the planning horizon, an Alert record is created and a notification is dispatched via the alerting pipeline.
- Alert deduplication suppresses repeated alerts for the same series and config if an open alert already exists with a breach date within 7 days of the new one.
- Alert severity is set to warning when the breach date is more than 14 days out and to critical when it is within 14 days.
- On-call routing sends critical alerts to PagerDuty; warning alerts go to Slack channels subscribed to the resource group.
Automated Provisioning Triggers
When approval_required is false, a confirmed alert directly enqueues a ProvisioningEvent linked to the configured workflow. Workflows are implemented as steps in an orchestration engine such as Temporal or AWS Step Functions, allowing complex provisioning sequences — for example, requesting cloud instances, waiting for them to be ready, and registering them with the load balancer — to be modeled as durable, retryable workflows.
When approval_required is true, the ProvisioningEvent is created in pending_approval status and a Slack approval request is sent to the owning team. An approved or rejected action via the Slack interactive button transitions the event status and either triggers or cancels the workflow.
API Design
GET /series— list monitored metric series with labels and retention policies.GET /series/{id}/metrics?from=&to=&step=— query raw or downsampled metric data.GET /series/{id}/projection?horizon_days=— retrieve the latest projection with confidence intervals and breach date.GET /alerts— list open and recent alerts with filters on severity and resource group.PATCH /alerts/{id}/acknowledge— acknowledge an alert with a comment.GET /provisioning-events— view provisioning event history and current status.POST /configs— create or update a CapacityConfig for a resource series.
Scalability and Observability
Projection computation is embarrassingly parallel across series. A worker pool processes series in batches, prioritizing series whose last projection is oldest. The time-series store handles fan-out read queries for projection training windows efficiently via its columnar storage format.
Key internal metrics: metric ingestion lag per Kafka partition, projection computation duration per model type, alert-to-notification latency, provisioning workflow success rate, and model accuracy measured as mean absolute percentage error of 7-day-ahead projections against actuals on a rolling basis.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Atlassian Interview Guide