LLM cost anomaly detection

LLM cost anomalies are often invisible until the invoice because spend can rise without request volume rising. A deployment may add retrieved context, change a router threshold, reduce cache hits, trigger retries, or move traffic to a more expensive model. Traditional traffic alerts miss those changes.

A useful anomaly detector watches cost drivers, not just total cost. Total daily spend is a lagging indicator. The earlier signals are tokens per request, model mix, fallback rate, cache hit rate, batch share, and spend per product surface.

Signals to monitor

Spend per endpoint, team, and customer per hour.
Input and output tokens per request at p50, p90, and p99.
Retry rate, fallback rate, and provider error rate.
Cache hit rate and cache-read token share.
Model mix changes after deploys or prompt releases.
Batchable work accidentally running synchronously.

Baselines

Use multiple baselines. A support bot has daily and weekly cycles. A nightly enrichment job has a batch window. An eval suite spikes around releases. A single global threshold will either miss real anomalies or wake people up constantly.

The best baseline is per owner and workload class. Alert when a feature deviates from its own normal pattern, not when the entire company crosses a generic spend limit.

Actionable alerts

An alert should explain the driver: “p90 input tokens doubled on endpoint X after deploy Y,” “cache hit rate dropped from 74% to 18%,” or “fallback to frontier model increased by 31%.” If the alert cannot name a likely owner and driver, it will be ignored.

Response loop

Attach deploy context, route-policy diffs, and the top affected customers or workloads. Then offer the next move: roll back a prompt, lower retrieval count, restore a router threshold, disable a fallback, or move a job to batch. Anomaly detection is only valuable when it shortens the time from invoice surprise to engineering action.

Back to research