Prompt cache attribution

Updated 21 May 2026 · first published 21 May 2026

Prompt caching is usually framed as an optimization problem: keep your prefixes stable, get a discount on input tokens, move on. Inside a FinOps function the harder question shows up later. Once a cache is working, it becomes shared infrastructure that several product surfaces depend on. The savings, the breakage, and the warm-up cost all need to land on somebody's ledger. Attribution is what turns "we enabled caching" into a number the CFO can defend per team, per feature, and per customer.

This page is about how to allocate prompt-cache-affected spend. It assumes you have already turned caching on. The question we care about is who gets credit when the cache pays off, who pays the tax when it does not, and how to report both without picking favorites.

Cache as shared infrastructure

A prompt cache is not a per-feature asset. The same cached prefix can be reused by an internal eval suite, a customer-facing copilot, a nightly batch job, and a support workflow. The provider bills you for the underlying tokens at a discounted cache-read rate, but the bill arrives at the account level. If you stop there, every feature looks equally efficient and no one is accountable for the prefix design choices that produced the savings in the first place.

The FinOps move is to treat the cache the way platform teams treat a Kubernetes cluster, a Snowflake warehouse, or a CDN. Someone owns the shared layer. Consumers get a unit-economics view that includes their share of the discount and their share of the overhead. Without that, cache savings drift into "general efficiency" and disappear from the conversation by the next quarter.

What a clean attribution model needs

Three numbers per consumer, per period:

Cache-read tokens used, separated from ordinary input tokens.
The implied savings at the consumer level, calculated as cache-read tokens times the difference between full input price and cache-read price.
The consumer's contribution to cache warmups, invalidations, and misses against shared prefixes.

The first two are straightforward once your usage records carry a cache-read field. The third is the one most teams skip, and it is the one that decides whether attribution feels fair.

Allocating the savings

There are three reasonable allocation methods. Each is defensible. Pick one and document it before anyone argues about a chargeback.

Direct attribution

Credit each consumer with the cache-read tokens it actually billed. Simple, auditable, and aligned with provider invoices. The downside is that whoever rides on a prefix that was warmed by another team gets a windfall, and the team that paid to keep the prefix stable gets nothing extra.

Warmup-adjusted attribution

Identify which team owns the canonical prefix for each cache key and route a portion of every reader's discount back to the owner. This rewards prefix stewardship and discourages the pattern where one team quietly destabilizes a prefix that other teams depend on. It requires a registry of cache prefixes with named owners.

Platform-tax attribution

Treat the entire cache as a platform overhead. Charge it back to consumers as a flat percentage of their LLM spend and let the platform team report a single net-savings number for the company. The cleanest accounting, the weakest behavioral incentive. Useful when product teams are too small to act on per-prefix data.

Reporting the cache tax

Even when caching is paying off in aggregate, individual teams pay a tax. Cache-friendly architectures cost engineering hours, slow down certain refactors, and force coordination across services that did not previously coordinate. A FinOps report that only shows the upside will eventually lose credibility when an engineering lead says, "we spent six weeks on prefix stability and saw nothing on our line."

The fix is to report the tax explicitly. For each team, show cache-read savings as a credit, and show three categories of cost alongside it: cache misses on prefixes the team owns, invalidations the team caused on prefixes other teams own, and warmup tokens spent re-priming a cache after a deploy. The net of those numbers is the team's real cache contribution. Teams that show a positive net should be the ones held up as examples. Teams that show a negative net deserve a conversation, not a public shaming, but the number needs to exist.

Who pays when cache breaks

Cache breakage is rarely a single bad commit. It is usually a chain. A platform team upgrades a tool schema. A retrieval service starts injecting context one block earlier. A prompt template adds a timestamp inside what used to be the stable prefix. By the time the cache hit rate drops from seventy percent to single digits, three teams have touched the request.

An attribution model needs a rule for this. Two options work in practice. The first is last-touch: the team whose deploy correlates with the cache-hit-rate drop owns the incremental spend until hit rate recovers. The second is owner-of-prefix: the team that owns the cache key absorbs the cost regardless of who broke it, on the theory that the owner should have been monitoring drift on their own prefix. Last-touch is easier to defend in incident review. Owner-of-prefix creates better long-term incentives. Most mature programs use last-touch for the first seven days and shift to owner-of-prefix if the regression is not fixed.

What the data has to look like

Attribution falls apart if your usage records cannot separate cache-read tokens from input tokens. Most providers expose the field; most pipelines drop it on the way into the warehouse because the original ingest schema predates caching. Before designing any chargeback, audit your token table for these columns: input tokens, cached input tokens, cache-write tokens where applicable, output tokens, model, owner tag, and request surface. If the cached-input column is missing or always zero, fix the pipeline before promising anyone a report.

Cache-write tokens deserve their own line. Some providers bill a premium on the first request that writes a new cache entry. If a team's workload pattern forces frequent writes, it is paying a higher effective rate than the unit-price table suggests. That premium needs to land somewhere visible, not buried in "other input tokens."

Customer-level allocation

For products sold to external customers, the same logic extends one layer down. A customer whose traffic rides cleanly on shared prefixes is cheaper to serve than one whose traffic forces unique long-tail prefixes. If you bill per token or per message at a flat rate, you are subsidizing the second customer with the first customer's efficiency. Whether that is a problem depends on your pricing strategy, but the FinOps team should at least be able to surface the gap.

The standard report is gross margin per customer with and without cache benefit. The "without" view shows what each customer would cost you at full input pricing. The "with" view shows the actual provider invoice. The difference tells you how much of your margin depends on continued cache health for that account, which is the same as asking how exposed your unit economics are to one prefix going unstable.

Where this fits in a FinOps program

Prompt cache attribution sits between cost allocation and anomaly detection. It uses the same owner tags and consumer dimensions as the rest of the chargeback model, and it feeds the same monthly reports. Where it differs is in the time horizon. Cache hit rate moves on the order of deploys, not invoice cycles, so the attribution view needs a daily or weekly cadence. Monthly is too slow to catch the team that broke a shared prefix in week one.

Done well, cache attribution stops being a footnote in the savings dashboard and starts behaving like a real line item: each team knows what it contributes and what it draws, the platform owner can defend the investment, and the CFO can answer the question, "what happens to our margin if caching stops working next quarter?" That question has a number, and the number is the one that matters.

Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research