Cost Allocation System: Low-Level Design
A cost allocation system attributes infrastructure spend to the teams, products, or customers that generated it. Without it, a single AWS bill lands on a shared cost center and engineers have no incentive to optimize. With it, teams see their cloud spend in dashboards, anomalies surface within hours, and chargeback to customers enables accurate SaaS unit economics. This design covers the data pipeline from cloud billing APIs to per-team attribution, anomaly detection, and budget alerting.
Core Data Model
CREATE TABLE CloudCostLineItem (
line_item_id BIGSERIAL PRIMARY KEY,
cloud_provider VARCHAR(20) NOT NULL, -- aws, gcp, azure
billing_date DATE NOT NULL,
service VARCHAR(100) NOT NULL, -- EC2, S3, BigQuery
resource_id VARCHAR(255), -- i-0abc123, arn:aws:s3:::my-bucket
tags JSONB NOT NULL DEFAULT '{}',-- {"team":"payments","env":"prod","customer":"acme"}
usage_amount NUMERIC(18,6) NOT NULL,
usage_unit VARCHAR(50), -- Hrs, GB-Mo, Requests
unblended_cost NUMERIC(12,6) NOT NULL, -- USD
imported_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (billing_date);
CREATE TABLE CostAllocation (
allocation_id BIGSERIAL PRIMARY KEY,
billing_date DATE NOT NULL,
team VARCHAR(100) NOT NULL,
product VARCHAR(100),
customer_id BIGINT,
service VARCHAR(100) NOT NULL,
allocated_cost NUMERIC(12,6) NOT NULL,
allocation_method VARCHAR(30) NOT NULL, -- tag, proportion, manual
source_line_items BIGINT[], -- IDs of contributing CloudCostLineItem rows
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (billing_date);
CREATE TABLE CostBudget (
budget_id SERIAL PRIMARY KEY,
team VARCHAR(100) NOT NULL,
period_type VARCHAR(10) NOT NULL DEFAULT 'monthly', -- monthly, quarterly
period_start DATE NOT NULL,
budget_usd NUMERIC(12,2) NOT NULL,
alert_pct SMALLINT NOT NULL DEFAULT 80, -- alert when this % consumed
alerted_at TIMESTAMPTZ,
UNIQUE (team, period_start)
);
CREATE TABLE CostAnomaly (
anomaly_id BIGSERIAL PRIMARY KEY,
billing_date DATE NOT NULL,
team VARCHAR(100),
service VARCHAR(100) NOT NULL,
expected_cost NUMERIC(12,6),
actual_cost NUMERIC(12,6),
z_score NUMERIC(6,2),
acknowledged BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON CloudCostLineItem(billing_date, tags);
CREATE INDEX ON CostAllocation(billing_date, team);
CREATE INDEX ON CostAnomaly(billing_date, acknowledged);
Ingestion Pipeline
import boto3, json, datetime
from decimal import Decimal
def ingest_aws_cost_explorer(billing_date: datetime.date):
"""
Pull daily costs from AWS Cost Explorer. Run nightly after billing data is finalized (~24h lag).
"""
ce = boto3.client('ce', region_name='us-east-1')
date_str = billing_date.isoformat()
paginator_token = None
while True:
kwargs = {
'TimePeriod': {'Start': date_str, 'End': (billing_date + datetime.timedelta(1)).isoformat()},
'Granularity': 'DAILY',
'GroupBy': [
{'Type': 'DIMENSION', 'Key': 'SERVICE'},
{'Type': 'DIMENSION', 'Key': 'RESOURCE_ID'},
],
'Metrics': ['UnblendedCost', 'UsageQuantity'],
'Filter': {'Not': {'Dimensions': {'Key': 'RECORD_TYPE', 'Values': ['Credit', 'Refund']}}},
}
if paginator_token:
kwargs['NextPageToken'] = paginator_token
resp = ce.get_cost_and_usage_with_resources(**kwargs)
rows = []
for result_by_time in resp['ResultsByTime']:
for group in result_by_time['Groups']:
service = group['Keys'][0]
resource = group['Keys'][1] if len(group['Keys']) > 1 else None
cost = Decimal(group['Metrics']['UnblendedCost']['Amount'])
usage = Decimal(group['Metrics']['UsageQuantity']['Amount'])
usage_unit = group['Metrics']['UsageQuantity']['Unit']
if cost == 0:
continue
tags = _fetch_resource_tags(resource) if resource else {}
rows.append((
'aws', billing_date, service, resource,
json.dumps(tags), float(usage), usage_unit, float(cost)
))
if rows:
db.executemany("""
INSERT INTO CloudCostLineItem
(cloud_provider, billing_date, service, resource_id, tags,
usage_amount, usage_unit, unblended_cost)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s)
ON CONFLICT DO NOTHING
""", rows)
paginator_token = resp.get('NextPageToken')
if not paginator_token:
break
def _fetch_resource_tags(resource_id: str) -> dict:
"""
Fetch resource tags from AWS Resource Groups Tagging API.
Tags like {"team":"payments","env":"prod"} drive allocation.
Cache results in Redis (1-hour TTL) — tag calls are rate-limited.
"""
import redis
r = redis.Redis(decode_responses=True)
cache_key = f"tags:{resource_id}"
cached = r.get(cache_key)
if cached:
return json.loads(cached)
tagging = boto3.client('resourcegroupstaggingapi', region_name='us-east-1')
try:
resp = tagging.get_resources(ResourceARNList=[resource_id])
tags = {t['Key']: t['Value'] for t in resp['ResourceTagMappingList'][0].get('Tags', [])} if resp['ResourceTagMappingList'] else {}
except Exception:
tags = {}
r.setex(cache_key, 3600, json.dumps(tags))
return tags
Allocation Engine
def allocate_costs(billing_date: datetime.date):
"""
Three-pass allocation:
1. Tagged resources → direct allocation to team in tag
2. Shared resources (no team tag) → proportion by tagged usage
3. Remaining unallocated → split equally across all teams
"""
# Pass 1: direct tag-based allocation
tagged_rows = db.fetchall("""
SELECT line_item_id, service, tags->>'team' AS team,
tags->>'product' AS product, unblended_cost
FROM CloudCostLineItem
WHERE billing_date=%s AND tags->>'team' IS NOT NULL
""", (billing_date,))
for row in tagged_rows:
db.execute("""
INSERT INTO CostAllocation
(billing_date, team, product, service, allocated_cost, allocation_method, source_line_items)
VALUES (%s, %s, %s, %s, %s, 'tag', %s)
""", (billing_date, row['team'], row['product'], row['service'],
row['unblended_cost'], [row['line_item_id']]))
# Pass 2: untagged shared costs → split by each team's proportion of tagged spend
total_tagged = db.fetchone(
"SELECT COALESCE(SUM(allocated_cost),0) AS total FROM CostAllocation WHERE billing_date=%s",
(billing_date,)
)['total']
untagged_rows = db.fetchall("""
SELECT line_item_id, service, unblended_cost
FROM CloudCostLineItem
WHERE billing_date=%s AND tags->>'team' IS NULL
""", (billing_date,))
if untagged_rows and total_tagged > 0:
team_shares = db.fetchall("""
SELECT team, SUM(allocated_cost)/(%s) AS share
FROM CostAllocation WHERE billing_date=%s GROUP BY team
""", (total_tagged, billing_date))
for row in untagged_rows:
for ts in team_shares:
db.execute("""
INSERT INTO CostAllocation
(billing_date, team, service, allocated_cost, allocation_method, source_line_items)
VALUES (%s, %s, %s, %s, 'proportion', %s)
""", (billing_date, ts['team'], row['service'],
float(row['unblended_cost']) * float(ts['share']),
[row['line_item_id']]))
Anomaly Detection
def detect_anomalies(billing_date: datetime.date, lookback_days: int = 28, z_threshold: float = 3.0):
"""
For each (team, service) pair, compute z-score vs. trailing 28-day window.
z > 3.0 → anomaly. Fires an alert to Slack.
"""
import numpy as np
combos = db.fetchall("""
SELECT DISTINCT team, service FROM CostAllocation
WHERE billing_date >= %s AND billing_date = %s AND billing_date < %s
GROUP BY billing_date ORDER BY billing_date
""", (team, service,
billing_date - datetime.timedelta(lookback_days), billing_date))
if len(history) < 7:
continue # not enough data
costs = [float(r['daily_cost']) for r in history]
mean, std = np.mean(costs), np.std(costs)
if std = z_threshold:
db.execute("""
INSERT INTO CostAnomaly
(billing_date, team, service, expected_cost, actual_cost, z_score)
VALUES (%s,%s,%s,%s,%s,%s)
""", (billing_date, team, service, round(mean, 4), today_cost, round(z, 2)))
_send_anomaly_alert(team, service, mean, float(today_cost), z)
def _send_anomaly_alert(team, service, expected, actual, z_score):
direction = "spike" if z_score > 0 else "drop"
pct_change = abs(actual - expected) / expected * 100
message = (
f":warning: Cost anomaly detected — {team} / {service}n"
f"Expected: ${expected:.2f} | Actual: ${actual:.2f} | {pct_change:.0f}% {direction} (z={z_score:.1f})"
)
slack_client.post_message(channel=f"#cost-alerts-{team}", text=message)
Budget Alert Check
def check_budget_alerts():
"""Run daily after allocation. Alert when a team's month-to-date spend crosses alert_pct."""
today = datetime.date.today()
month_start = today.replace(day=1)
budgets = db.fetchall("""
SELECT b.*, COALESCE(SUM(a.allocated_cost),0) AS mtd_spend
FROM CostBudget b
LEFT JOIN CostAllocation a
ON a.team=b.team AND a.billing_date >= b.period_start AND a.billing_date = budget['alert_pct']:
slack_client.post_message(
channel=f"#cost-alerts-{budget['team']}",
text=(
f":money_with_wings: Budget alert — {budget['team']}n"
f"${budget['mtd_spend']:.0f} of ${budget['budget_usd']:.0f} "
f"({pct_used:.0f}%) used with {(today.replace(month=today.month%12+1,day=1)-datetime.timedelta(1)-today).days+1} days left"
)
)
db.execute(
"UPDATE CostBudget SET alerted_at=NOW() WHERE budget_id=%s",
(budget['budget_id'],)
)
Key Design Decisions
- Tag-first allocation: teams that tag their resources properly get exact attribution. The proportion pass for untagged resources is deliberately visible — engineers can see “15% of your allocation comes from untagged shared resources” and are incentivized to add tags. Report untagged fraction in dashboards to drive tagging compliance.
- Partition by billing_date: cost data grows ~30 rows per resource per day. Monthly partitions let old data be archived to cold storage (S3 via partition DETACH) without touching the live table. Anomaly detection uses the last 28 days — always within the current + prior partition.
- Z-score over fixed thresholds: a fixed threshold ($50/day anomaly) misses a service that normally costs $5,000/day spiking to $7,500 (50% jump, z=1.5 — not anomalous) while alerting on a $0 → $50 jump in a new service. Z-score correctly normalizes by historical variance. Require at least 7 days of history to avoid false positives for new services.
- 24-hour billing lag: AWS Cost Explorer finalizes daily costs ~24 hours after the billing day ends. Ingest billing_date T at the end of day T+1. Never ingest the current day’s partial costs — they skew anomaly baselines.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why partition CloudCostLineItem by billing_date instead of a single table?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”AWS billing data grows approximately 30 line items per resource per day. A company with 5,000 EC2 instances and hundreds of S3 buckets can accumulate 500,000+ rows per day — 180M rows per year. A single unpartitioned table with 2 years of history (360M rows) makes anomaly detection queries (which need 28-day windows) extremely slow even with indexes. Monthly partitions: each partition covers one calendar month. The 28-day anomaly lookback always spans the current + previous partition — a maximum of 2 partitions touched per query. Old partitions (12+ months) are DETACHed and moved to S3 as Parquet for Athena queries. Partition DETACH is O(1) — it removes the partition from the parent table without touching the data. The active table stays lean; historical data remains queryable via Athena.”}},{“@type”:”Question”,”name”:”How does the proportional pass work for untagged shared costs like NAT gateway or load balancer?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Shared infrastructure (NAT gateways, load balancers, VPC endpoints, data transfer) serves all teams but belongs to no single team tag. Without allocation, it lands as "untagged" overhead that disappears from team dashboards. The proportional pass distributes untagged costs based on each team’s share of total tagged spend: if the payments team spent $800 and the identity team spent $200 (total tagged=$1,000), the payments team gets 80% of untagged costs, identity gets 20%. This is a "you use more infrastructure, you pay more shared overhead" model. A fairer alternative for specific shared resources: tag the load balancer with team=shared and split it proportionally by request count per team (requires application-level metrics). The simple proportional-by-spend model is accurate enough for most cost reviews and requires no additional instrumentation.”}},{“@type”:”Question”,”name”:”How do you handle cost spikes that are legitimate, not anomalies?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Z-score anomaly detection fires on any statistically unusual day — including legitimate spikes like "we ran a batch ML training job on the 15th of every month" or "Black Friday traffic tripled our database costs." Two approaches to reduce noise: (1) Acknowledged anomalies: engineers mark resolved anomalies as acknowledged=TRUE in CostAnomaly. A weekly anomaly report only shows unacknowledged rows. After acknowledging "monthly ML training job," it won’t alert again for that signature. (2) Expected events: maintain a CostExpectedEvent table (date, service, reason, expected_multiplier). Before inserting an anomaly, check: if today is in expected_events for this service, skip the z-score check. Teams register planned batch jobs, migrations, or load tests in advance. This shifts anomaly detection from reactive (alert on anything unusual) to contextual (alert on unexplained unusual spend).”}},{“@type”:”Question”,”name”:”How do you allocate costs to individual customers for SaaS unit economics?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Per-customer cost attribution adds a customer dimension to allocation. Tag customer-specific infrastructure: tag EC2 instances or RDS clusters with "customer=acme-corp" where dedicated resources exist. For shared infrastructure serving multiple customers, use usage-based splitting: API calls, storage bytes, or compute minutes per customer (from application metrics) as the allocation key. Schema addition: customer_id column in CostAllocation. Query: SELECT c.customer_name, SUM(ca.allocated_cost) AS monthly_infra_cost FROM CostAllocation ca JOIN Customer c USING (customer_id) WHERE ca.billing_date >= $month_start GROUP BY c.customer_name ORDER BY monthly_infra_cost DESC. Compare with customer MRR: if ACME pays $10K/month but costs $8K to serve, gross margin is 20% — well below target. This per-customer P&L drives pricing decisions and identifies customers to upsell to dedicated tiers.”}},{“@type”:”Question”,”name”:”How do you forecast next month’s cloud spend to set accurate budgets?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Simple linear forecast: compute average daily spend over the last 90 days per team and service. Multiply by days in next month. SELECT team, AVG(daily_cost)*30 AS forecast_30d FROM (SELECT team, billing_date, SUM(allocated_cost) AS daily_cost FROM CostAllocation WHERE billing_date > NOW()-INTERVAL ’90 days’ GROUP BY team, billing_date) sub GROUP BY team. Adjust for known growth: if active users are growing 10%/month and compute scales linearly with users, multiply the forecast by 1.1. Budget alert threshold: set at 90% of the 90-day forecast + 10% buffer. This automatically adjusts as spend grows. For capital planning: use the 90-day trend slope to project 12-month spend. Alert finance when the slope implies exceeding the annual cloud budget by >15% — early enough to either optimize or re-negotiate contracts.”}}]}
Cost allocation and cloud billing system design is discussed in Amazon system design interview questions.
Cost allocation and compute resource attribution design is covered in Databricks system design interview preparation.
Cost allocation and cloud spend attribution design is discussed in Netflix system design interview guide.