LLM cost anomaly detection
LLM cost anomalies are often invisible until the invoice because spend can rise without request volume rising. A deployment may add retrieved context, change a router threshold, reduce cache hits, trigger retries, or move traffic to a more expensive model. Traditional traffic alerts miss those changes.
A useful anomaly detector watches cost drivers, not just total cost. Total daily spend is a lagging indicator. The earlier signals are tokens per request, model mix, fallback rate, cache hit rate, batch share, and spend per product surface.
Signals to monitor
- Spend per endpoint, team, and customer per hour.
- Input and output tokens per request at p50, p90, and p99.
- Retry rate, fallback rate, and provider error rate.
- Cache hit rate and cache-read token share.
- Model mix changes after deploys or prompt releases.
- Batchable work accidentally running synchronously.
Baselines
Use multiple baselines. A support bot has daily and weekly cycles. A nightly enrichment job has a batch window. An eval suite spikes around releases. A single global threshold will either miss real anomalies or wake people up constantly.
The best baseline is per owner and workload class. Alert when a feature deviates from its own normal pattern, not when the entire company crosses a generic spend limit.
Actionable alerts
An alert should explain the driver: “p90 input tokens doubled on endpoint X after deploy Y,” “cache hit rate dropped from 74% to 18%,” or “fallback to frontier model increased by 31%.” If the alert cannot name a likely owner and driver, it will be ignored.
Response loop
Attach deploy context, route-policy diffs, and the top affected customers or workloads. Then offer the next move: roll back a prompt, lower retrieval count, restore a router threshold, disable a fallback, or move a job to batch. Anomaly detection is only valuable when it shortens the time from invoice surprise to engineering action.