Question 1

Why partition CloudCostLineItem by billing_date instead of a single table?

Accepted Answer

AWS billing data grows approximately 30 line items per resource per day. A company with 5,000 EC2 instances and hundreds of S3 buckets can accumulate 500,000+ rows per day — 180M rows per year. A single unpartitioned table with 2 years of history (360M rows) makes anomaly detection queries (which need 28-day windows) extremely slow even with indexes. Monthly partitions: each partition covers one calendar month. The 28-day anomaly lookback always spans the current + previous partition — a maximum of 2 partitions touched per query. Old partitions (12+ months) are DETACHed and moved to S3 as Parquet for Athena queries. Partition DETACH is O(1) — it removes the partition from the parent table without touching the data. The active table stays lean; historical data remains queryable via Athena.

Question 2

How does the proportional pass work for untagged shared costs like NAT gateway or load balancer?

Accepted Answer

Shared infrastructure (NAT gateways, load balancers, VPC endpoints, data transfer) serves all teams but belongs to no single team tag. Without allocation, it lands as "untagged" overhead that disappears from team dashboards. The proportional pass distributes untagged costs based on each team's share of total tagged spend: if the payments team spent $800 and the identity team spent $200 (total tagged=$1,000), the payments team gets 80% of untagged costs, identity gets 20%. This is a "you use more infrastructure, you pay more shared overhead" model. A fairer alternative for specific shared resources: tag the load balancer with team=shared and split it proportionally by request count per team (requires application-level metrics). The simple proportional-by-spend model is accurate enough for most cost reviews and requires no additional instrumentation.

Question 3

How do you handle cost spikes that are legitimate, not anomalies?

Accepted Answer

Z-score anomaly detection fires on any statistically unusual day — including legitimate spikes like "we ran a batch ML training job on the 15th of every month" or "Black Friday traffic tripled our database costs." Two approaches to reduce noise: (1) Acknowledged anomalies: engineers mark resolved anomalies as acknowledged=TRUE in CostAnomaly. A weekly anomaly report only shows unacknowledged rows. After acknowledging "monthly ML training job," it won't alert again for that signature. (2) Expected events: maintain a CostExpectedEvent table (date, service, reason, expected_multiplier). Before inserting an anomaly, check: if today is in expected_events for this service, skip the z-score check. Teams register planned batch jobs, migrations, or load tests in advance. This shifts anomaly detection from reactive (alert on anything unusual) to contextual (alert on unexplained unusual spend).

Question 4

How do you allocate costs to individual customers for SaaS unit economics?

Accepted Answer

Per-customer cost attribution adds a customer dimension to allocation. Tag customer-specific infrastructure: tag EC2 instances or RDS clusters with "customer=acme-corp" where dedicated resources exist. For shared infrastructure serving multiple customers, use usage-based splitting: API calls, storage bytes, or compute minutes per customer (from application metrics) as the allocation key. Schema addition: customer_id column in CostAllocation. Query: SELECT c.customer_name, SUM(ca.allocated_cost) AS monthly_infra_cost FROM CostAllocation ca JOIN Customer c USING (customer_id) WHERE ca.billing_date >= $month_start GROUP BY c.customer_name ORDER BY monthly_infra_cost DESC. Compare with customer MRR: if ACME pays $10K/month but costs $8K to serve, gross margin is 20% — well below target. This per-customer P&L drives pricing decisions and identifies customers to upsell to dedicated tiers.

Question 5

How do you forecast next month's cloud spend to set accurate budgets?

Accepted Answer

Simple linear forecast: compute average daily spend over the last 90 days per team and service. Multiply by days in next month. SELECT team, AVG(daily_cost)*30 AS forecast_30d FROM (SELECT team, billing_date, SUM(allocated_cost) AS daily_cost FROM CostAllocation WHERE billing_date > NOW()-INTERVAL '90 days' GROUP BY team, billing_date) sub GROUP BY team. Adjust for known growth: if active users are growing 10%/month and compute scales linearly with users, multiply the forecast by 1.1. Budget alert threshold: set at 90% of the 90-day forecast + 10% buffer. This automatically adjusts as spend grows. For capital planning: use the 90-day trend slope to project 12-month spend. Alert finance when the slope implies exceeding the annual cloud budget by >15% — early enough to either optimize or re-negotiate contracts.

Cost Allocation System Low-Level Design: Cloud Billing Ingestion, Team Attribution, and Anomaly Detection

Cost Allocation System: Low-Level Design

Core Data Model

Ingestion Pipeline

Allocation Engine

Anomaly Detection

Budget Alert Check

Key Design Decisions