Token budget implementation

Updated 21 May 2026 · first published 21 May 2026

A token budget that lives in a spreadsheet but nowhere in the call path is a wish, not a control. Every FinOps program eventually runs into this wall: finance has approved a per-feature envelope, the owning team has agreed to it on paper, and a month later the invoice has overshot by some embarrassing multiple because nothing in the runtime ever read the number. This page is about the unglamorous engineering work that turns a budget line into an enforcement point - where to put it, what it should do when a feature blows through it, and how to report the burn back to the owner in time for them to act.

The audience here is the engineer who has been handed a budget by a finance partner and asked to make it stick. The work splits into three concerns: where enforcement lives in the request path, what happens on overrun, and how the owning team sees their position before the month closes.

Why budgets fail at the policy layer

The most common failure mode is a budget defined in a governance document that no service reads. A team writes down "the contract-redline feature gets ten million tokens per month," circulates it for approval, and never instruments the counter that would actually enforce it. Finance reviews invoices after the fact and discovers the overrun two weeks too late. The fix is not a better policy document. The fix is moving the budget number into a runtime store that the call path consults before every model call, so that the policy and the enforcement are the same artifact.

This is also why budgets expressed only in dollars tend to drift. Provider rates change, model mixes change inside a single feature, and the cost-per-call moves underneath the budget. Tokens are the stable unit because the application controls them directly. Dollar conversion belongs in the reporting layer, not in the enforcement layer.

Three places to enforce

Enforcement has to attach to something the application cannot easily bypass. There are essentially three options, and most production stacks end up using two of them in combination.

The gateway

A gateway sits between every caller and every provider. Each request carries a feature tag in a header. The gateway reads the tag, looks up the remaining budget for that feature, and decides to forward, throttle, or deny. This is the strongest pattern because there is one place to read, one place to log, and one place to change policy. The cost is that the gateway becomes critical path: if it is slow, every feature is slow, and if it is down, every feature is down. Teams that adopt this pattern invest in gateway redundancy and a local fail-open mode for non-critical traffic.

The SDK wrapper

An SDK wrapper is a thin library in each language the company uses that wraps the provider client. Feature owners import it, pass their feature tag at construction, and call it the way they would call the underlying client. The wrapper records spend against the budget store and refuses calls that would exceed the remaining envelope. This pattern is easier to roll out incrementally - one team at a time, one codebase at a time - but it has no central choke point, so a team that bypasses the wrapper to call the provider directly bypasses the budget. Most organizations start here and add a gateway once enough services exist that per-codebase rollout stops scaling.

Prompt-level limits

The third option is not really budget enforcement. Input truncation, conversation-window trimming, and a hard max_tokens on every call are cost controls per request, not per feature. They keep a single runaway prompt from eating a quarter of the day's budget in one shot, but they do not stop a feature from making a million well-shaped calls. Treat prompt-level limits as a last-resort safety net underneath the gateway or wrapper, not as the budget itself.

What overrun should actually do

The most consequential design decision is what happens at the moment of overrun, because the answer has to differ by traffic class. A single global behavior - "block everything past the cap" - guarantees a customer-visible outage the first time a feature misforecasts. A single soft behavior - "always allow, just notify" - guarantees the budget is decorative.

Three patterns cover almost every case. A soft cap allows the call, increments an overrun counter, notifies the owning team, and marks the feature as over-budget on the dashboard. A hard cap blocks the call and returns a structured error the caller can handle gracefully. A degraded cap routes the call to a cheaper model, a smaller context window, or a cached response, and tags the response so the owner can see how much traffic served degraded. The right default depends on the traffic class: customer-facing interactive traffic usually starts soft and graduates to degraded; internal tooling can go straight to hard; eval and batch traffic should be hard by default because there is no user waiting.

A degraded path is the most operator-friendly of the three because it preserves the user experience while sending a clear signal upstream. It also requires the most engineering, since the application has to be able to actually run on the cheaper model without falling over. Teams that build this once tend to keep it as the default response to overrun forever after.

Reporting the budget back to the owner

The dashboard finance reads is not the dashboard the owning team needs. The owner is trying to decide whether to ship a fix, throttle a customer, or ask for more budget, and the data has to answer those three questions directly.

Daily burn rate is the first number. Days-remaining-at-current-pace is the second, because that is the unit the owner thinks in. The top requesters within the feature - by tenant, by endpoint, by code path - surface whether the overrun is broad-based or driven by one heavy user. The last overrun event with timestamp and action taken closes the loop so the owner can verify that the policy fired when it was supposed to. Finance sees these numbers rolled up into the chargeback report; the owner sees them per-feature, on the surface they already use to debug their service.

What goes on the per-feature dashboard

Budget set - the token envelope agreed for the period, with the date it was last changed.
Spend MTD - tokens consumed so far, broken out by input, output, reasoning, and cache.
Projected EOM - straight-line extrapolation from current burn, plus a 7-day rolling estimate.
Last overrun event - when the cap last fired, which policy applied, and how many calls were affected.
Current cap policy - soft, hard, or degraded, and which conditions trigger which.
Top requesters - the tenants or endpoints driving the largest share of the burn.

Anything else is decoration. The owner reads these six tiles, decides, and moves on.

Common implementation traps

Most token-budget rollouts trip on the same handful of issues, and they all come from counting the wrong thing or counting it at the wrong moment.

Using cost as the budget unit. Provider rates change. A budget in dollars looks stable but quietly redefines itself every time a price card updates. Budget in tokens and convert to cost in the reporting layer.

Forgetting reasoning tokens. Reasoning models emit tokens the user never sees but that show up on the invoice. A counter that only watches input and output undercounts these features by a large margin.

Forgetting cache-write tokens. Cache reads are cheap, but the write that primed the cache was not. A budget that only tracks input and output misses the entire cache-population cost and looks artificially healthy.

Forgetting tool-call output. When an agent emits a long tool call as part of its response, that output is billable. Counters that only watch the final user-visible answer miss it.

Not accounting for retries. A naive wrapper counts the successful call. A correct wrapper counts every attempt, because every attempt was billed. Retries on transient errors are a quiet source of overrun.

Race conditions in the counter. Two concurrent calls that each read the remaining budget, decide they fit, and then both commit will both succeed and push the counter past zero. The fix is either an atomic decrement against a central store or a small slack buffer that absorbs the race without changing the user-visible policy.

First week of implementation

The shortest path from "we agreed on a budget" to "the budget is enforced" is roughly a week of focused work, and the order matters.

Day one: tag the requests. Every model call carries a feature tag at the call site. No tag, no call in production.
Day two: instrument the counter. Tokens flow into a per-feature ledger keyed by tag and period. Input, output, reasoning, and cache-write tokens are all separately tracked.
Day three: surface the dashboard. Owners see their burn, projected EOM, and top requesters on a page they can bookmark.
Day four: set the soft cap. The cap exists, the policy fires, the owner gets notified, no traffic is blocked. The point is to validate that the data is right before any user is affected.
Day five and after: switch to hard or degraded once owners trust the data. Do not graduate features whose owners still dispute the burn number. Trust in the counter is the prerequisite for any blocking policy.

The discipline is to ship enforcement and ownership at the same pace. A hard cap on a feature whose owner does not yet believe the counter is a guaranteed escalation. A soft cap on a feature whose owner reads the dashboard every morning is the foundation of a working FinOps practice.

Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

← Back to research