LLM cost anomaly detection

LLM cost anomalies are often invisible until the invoice because spend can rise without request volume rising. A deployment may add retrieved context, change a router threshold, reduce cache hits, trigger retries, or move traffic to a more expensive model. Traditional traffic alerts miss those changes.

A useful anomaly detector watches cost drivers, not just total cost. Total daily spend is a lagging indicator. The earlier signals are tokens per request, model mix, fallback rate, cache hit rate, batch share, and spend per product surface.

Signals to monitor

Baselines

Use multiple baselines. A support bot has daily and weekly cycles. A nightly enrichment job has a batch window. An eval suite spikes around releases. A single global threshold will either miss real anomalies or wake people up constantly.

The best baseline is per owner and workload class. Alert when a feature deviates from its own normal pattern, not when the entire company crosses a generic spend limit.

Actionable alerts

An alert should explain the driver: “p90 input tokens doubled on endpoint X after deploy Y,” “cache hit rate dropped from 74% to 18%,” or “fallback to frontier model increased by 31%.” If the alert cannot name a likely owner and driver, it will be ignored.

Response loop

Attach deploy context, route-policy diffs, and the top affected customers or workloads. Then offer the next move: roll back a prompt, lower retrieval count, restore a router threshold, disable a fallback, or move a job to batch. Anomaly detection is only valuable when it shortens the time from invoice surprise to engineering action.

Back to research