What is step 4: choose soft and hard thresholds?

A budget without a response to a breach is just a chart. Define two thresholds per envelope: Soft threshold (e.g. 80% of envelope) - alert the owning team and trigger a review. Nothing changes automatically; a human decides. Hard threshold (e.g.

How to budget for AI spend

Updated 11 June 2026 · first published 31 May 2026

Budgeting for LLM spend feels harder than budgeting for SaaS because it is. A SaaS line item is a fixed number you negotiate once a year. LLM spend is variable, per-request, and driven by code paths that change every week. But the discipline is not exotic - it is the same envelope-and-forecast loop finance already runs for cloud compute, applied to a different cost driver. This is a starter framework for teams setting their first real AI budget.

Why a flat number doesn't work

The instinct is to set one company-wide AI budget - "we'll spend $40K a month on OpenAI" - and watch the total. It fails for the same reason a single cloud budget fails: when the number is breached, you have no idea which workload caused it, so you have no idea what to cut. A useful AI budget is not one number. It is a set of envelopes, one per workload, each with a named owner.

Step 1: Attribute before you budget

You cannot budget spend you cannot trace. The prerequisite for every step below is cost attribution - the ability to see spend broken out by feature, team, environment, and customer. A budget set on unallocated spend is a guess, and a guess cannot be enforced. If you do nothing else first, get attribution working.

Step 2: Forecast from token shapes, not last month's total

To project next quarter's spend, build it up from the cost drivers rather than extrapolating the invoice. For each significant workload, estimate:

Request volume - how many calls per month, and how it tracks with a business driver (active users, transactions, documents processed).
Token shape - average input, output, cache-read, and reasoning tokens per request. This is where the cost actually lives. (See how providers charge.)
Model mix - which model tiers handle the traffic, since the per-token price varies by more than 10× across the ladder.

Multiply those out and you have a forecast that responds to reality: if volume doubles, you know the cost; if a feature moves to a cheaper model, you can see the saving before you ship it. Tie the forecast to a business metric so finance can sanity-check it against the plan.

Step 3: Set an envelope per workload

Give each workload a monthly envelope derived from the forecast plus a deliberate margin for growth. The envelope is not aspirational - it is the number the owning team is accountable to. Express it in dollars, but track it in the underlying drivers (cost per request × volume) so a breach can be diagnosed, not just noticed.

Step 4: Choose soft and hard thresholds

A budget without a response to a breach is just a chart. Define two thresholds per envelope:

Soft threshold (e.g. 80% of envelope) - alert the owning team and trigger a review. Nothing changes automatically; a human decides.
Hard threshold (e.g. 100%) - the system degrades behaviour rather than letting spend run unbounded: drop to a cheaper model, reduce retrieval depth, move non-urgent work to a batch API, or queue lower-priority traffic.

The point of the hard threshold is not to punish the team that exceeded it. It is to make sure a runaway cost - a broken cache, a spike, an agent loop - is bounded by design instead of discovered at month-end.

Step 5: Reconcile, then adjust

At month-end, compare your internal estimate to the actual provider invoice. Reconciliation does two things: it keeps your forecast honest (a recurring variance usually means a stale price assumption or a misclassified token type), and it feeds the next cycle's envelopes with real numbers. A budget that is never reconciled drifts away from the invoice until no one trusts it.

Step 6: Give the budget an owner at three levels

Budgets fail organizationally more often than technically. The pattern that holds up has three explicit roles:

A finance owner (FP&A or a FinOps lead) who owns the model, the forecast, and the monthly close.
A platform owner (the engineer who runs the gateway) who owns the tagging that makes attribution reliable and enforces the hard thresholds.
An executive sponsor (CFO or CTO) who chairs the review and resolves disputes when a team wants more envelope.

Smaller teams collapse the first two roles into one FinOps-minded engineer, but the sponsor is non-negotiable - without it, budget breaches become political fights instead of operational decisions.

A realistic first 30 days

Week 1: get attribution working - tag traffic by feature and environment, even crudely.
Week 2: measure current token shapes and model mix per workload; build the bottom-up forecast.
Week 3: set envelopes and soft thresholds; wire up alerts.
Week 4: add hard-threshold degradation on the highest-spend workloads; run a first reconciliation against the latest invoice.

That is enough to turn "we'll find out at month-end" into "we know where we are today, and the worst case is bounded." Everything after that is refinement.

LLM Cost Per User Benchmarks - compare your per-user spending against industry benchmarks by role.
Token budget implementation - turning budgets into runtime enforcement.
Eval cost allocation - handling evals as a separate budget category.
Cost per request as a product KPI - making cost visible to owners.
LLM budget governance - budgets, quotas, and degradation paths in depth.
What is LLM cost attribution? - the prerequisite for any budget.

Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research