FinOps for LLM: a practical framework

FinOps for LLMs is the practice of making AI spend visible, attributable, optimizable, and accountable. It borrows from cloud FinOps, but the cost drivers are different. A cloud bill is shaped by instances, storage, transfer, and commitments. An LLM bill is shaped by input tokens, output tokens, cache writes, cache reads, retries, model mix, context length, batch eligibility, and quality requirements.

The first mistake teams make is treating LLM spend as one provider-level number. That hides the real levers. A support summarization endpoint, an agent workflow, a nightly enrichment job, and an eval suite may all use the same model, but they have different latency needs, quality thresholds, cacheability, and ownership. FinOps starts when those workloads are separated.

1. Visibility

Visibility means normalizing provider invoices and gateway logs into a common record. Each request should carry enough metadata to answer: who owns it, which feature generated it, which model handled it, how many token classes were billed, whether the call retried, and whether the output produced user value.

Good visibility separates input, output, cache-write, cache-read, batch, and retry spend. Without this split, teams accidentally count cache-read discounts as savings, miss retry loops, and compare providers on list price rather than cost per successful task.

2. Attribution

Attribution maps spend to teams, products, customers, and workloads. The point is not blame; the point is decision quality. A finance team cannot govern spend if every line item says “OpenAI.” An engineering leader cannot optimize a product surface if spend is grouped only by provider account.

Minimum useful attribution usually includes environment, endpoint, team owner, customer or tenant where allowed, model, provider, and workload class. Mature programs add business metrics such as resolved ticket, completed analysis, generated report, or successful agent task.

3. Optimization

Optimization should follow attribution. The common levers are model routing, semantic caching, prompt caching, context compaction, batch routing, provider arbitrage, and fallback policy tuning. Each lever changes the bill differently. Routing changes model mix. Caching changes input-token and latency economics. Batch moves work into discounted asynchronous lanes. Compaction reduces repeated context.

The test is not “did token count go down?” The test is “did cost per successful task go down without breaking quality, latency, or reliability?”

4. Accountability

Showback should come before chargeback. First give teams a credible monthly view of their spend and drivers. Once the numbers are trusted, chargeback can make sense for mature organizations. The goal is a repeatable operating rhythm: review the largest drivers, agree on optimization candidates, ship changes behind flags, and reconcile against the next invoice.

Back to research