Why does my LLM bill spike?

Updated 31 May 2026 · first published 31 May 2026

A spike in an LLM bill almost never comes from a price change. It comes from a change in how your code uses the models - a change that often ships in an ordinary pull request, with no line item that says "this will double our token cost." Because providers bill per token and per model, a single deploy can move the bill sharply while request volume stays flat.

Here are the causes that show up again and again, roughly in order of how often they catch teams out, and how to catch each one before the invoice does.

1. Model-mix drift

Someone changes a default, raises a router threshold, or "temporarily" points a feature at the flagship model to fix a quality complaint - and forgets to point it back. Because the gap between model tiers can be more than 10× per token, shifting even a fraction of traffic up the ladder moves the bill hard. This is the single most common spike.

Catch it: track the share of requests per model over time, not just total spend. A change in the mix is the earliest signal, and it shows up before the monthly invoice.

2. A broken cache

If you rely on prompt caching, your discount depends on a stable prompt prefix. Change a system prompt, reorder context, or inject a timestamp into the cached region, and the cache-hit rate collapses. Those tokens that used to be billed at a steep discount are now billed at full price - same traffic, much higher cost.

Catch it: monitor cache-read token share as a first-class metric. A sudden drop in cache hits is a spike in disguise.

3. Reasoning tokens you can't see

Adopting a reasoning model, or raising its reasoning budget, can multiply the cost of a request without changing the visible output at all. The reasoning tokens are billed but never shown, so the spike has no obvious cause in the UI.

Catch it: separate reasoning tokens from visible output in your cost data, and treat a reasoning-model rollout as a budgeted change, not a drop-in swap.

4. Retrieval depth crept up

In a RAG system, the number and size of retrieved documents go straight into the prompt as input tokens. A relevance fix that bumps top-k from 5 to 20, or a chunk-size change, raises input cost on every single request. It looks like a quality improvement in code review; it looks like a spike on the invoice.

Catch it: track average input tokens per request per feature. Retrieval changes show up there immediately.

5. Retries, timeouts, and fallback chains

Aggressive retry logic and fallback chains are good for reliability and dangerous for cost. If a primary model starts timing out and every failed call retries on a more expensive fallback, you can pay two or three times for each user request during an incident - and the cost outlives the incident if no one notices.

Catch it: count retries and fallback firings, and alert when they rise. A reliability problem and a cost problem are often the same event.

6. A runaway agent loop

Agentic workloads call models in loops. A planning bug, a tool that returns noise, or a missing stop condition can send an agent into dozens of model calls for a task that should have taken three. Multiply that across production traffic and the bill moves fast. This is the spike most likely to be a genuine bug rather than a tuning choice.

Catch it: cap and monitor calls-per-task, and set spend guardrails on agent workloads specifically.

7. Output length and format changes

Raising a max tokens limit, switching to a more verbose response format, or prompting for "detailed" answers increases output tokens - the most expensive token type. A prompt tweak that improves answer quality can quietly raise per-request cost across the board.

Catch it: track average output tokens per request and treat prompt changes as cost changes.

8. Real growth (the good kind)

Sometimes the bill is up because usage is up - more customers, more requests, a successful launch. That is not a problem to fix; it is a number to attribute. The goal is to be able to say "spend rose 30% because feature X served 40% more requests at a stable cost per request," rather than guessing. A spike you can explain is a spike you can defend.

The common thread: you cannot diagnose what you cannot attribute

Every cause above is invisible on a dashboard that only shows total spend. They become obvious the moment spend is broken out by feature, model, token type, and cache-hit rate. That breakdown is cost attribution, and it is the difference between "the bill went up and we don't know why" and "the checkout summarizer shifted to the flagship model on Tuesday - here's the fix." Pair attribution with anomaly detection and most spikes are caught within hours, not at month-end.

How do LLM providers charge? - the pricing mechanics behind every spike.
What is LLM cost attribution? - how to trace a spike to its cause.
LLM cost anomaly detection - catching spikes before the invoice.
How to budget for AI spend - setting envelopes so spikes are bounded.

Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research