Semantic cache economics

Updated 21 May 2026 · first published 21 May 2026

Most semantic-cache write-ups treat the cache as an engineering optimization: stand up pgvector, pick a similarity threshold, ship. From a FinOps seat the picture is different. A semantic cache is a shared cost center that taxes every request, returns savings unevenly across teams, and quietly changes the unit economics of every workload that flows through it. If you do not model it that way, the platform team absorbs the embedding spend, a handful of high-hit-rate features take the savings, and nobody can explain why blended cost-per-call moved.

This page lays out the economic model: how to compute cost-per-lookup, how to allocate the embedding tax fairly, how to use hit rate as a chargeback input, and how to decide when the cache is no longer worth running for a given team. The implementation choices - vector store, threshold tuning, scoping rules - are covered elsewhere. Here the question is who pays, who benefits, and what shows up on the invoice.

Cost-per-lookup as the primary metric

Total cache spend is the wrong number to watch. The cache lives or dies on cost-per-lookup, which has three components: the embedding call on the incoming request, the vector-store query, and the amortized write cost of populating the cache in the first place. For a workload using a small embedding model at roughly two cents per million tokens, an average request of 400 tokens costs around 0.0008 cents to embed. Add the vector lookup and the bookkeeping, and a typical lookup lands in the small fractions of a cent.

That sounds negligible until you compare it to the saved frontier call. A cached hit avoids a model invocation worth anywhere from one to twenty cents depending on the model and the response length. The break-even hit rate is whatever ratio makes embedding-plus-lookup cost equal expected savings. For most stacks that floor sits somewhere between two and five percent. Below that, the cache is taxing everyone to subsidize a tiny number of hits. Above it, every additional point of hit rate is pure margin.

The single most useful FinOps view is cost-per-lookup plotted against hit rate per endpoint, refreshed weekly. It exposes endpoints that are paying the embedding tax with nothing to show for it, and it gives you a defensible line for cutting cache scope when a workload drifts.

The embedding tax and how to allocate it

The embedding cost is the awkward part of cache accounting. It is incurred on every request, including misses. If the platform team pays for it out of a shared budget, three things go wrong. Teams that contribute the most miss traffic look free on the cost dashboard. Teams that benefit from high hit rates also look free, because their savings show up as the absence of model spend rather than as cache revenue. And the platform team carries a growing line item whose owner is nominally everyone, which in practice means no one defends it during budget reviews.

A workable allocation has three pieces. First, embed cost is allocated to the requesting workload at the moment of the lookup, the same way model tokens are. Second, the vector store cost is split by share of stored entries, refreshed monthly so a team that has not written anything new for a quarter stops paying for prior tenants' rows. Third, savings from hits are credited back to the workload that hit, so the dashboard shows net cache contribution per team rather than gross spend.

The accounting trick is that this only works if the gateway tags every embed call and every lookup with the same workload identifier the rest of your pipeline uses. If the cache is upstream of your tagging, it is invisible to chargeback, and you will spend months arguing about whether the cache "really" saves money.

Hit rate as a chargeback input

Hit rate is the variable that turns the cache from infrastructure into a finance instrument. A workload running at sixty percent hit rate is paying roughly forty percent of its uncached cost plus a tiny embedding tax. A workload at five percent hit rate is paying ninety-five percent of its uncached cost plus the same tax. Both numbers belong on the chargeback statement.

The statement that resonates with engineering leaders looks like this: this feature spent X dollars on model calls, Y dollars on cache infrastructure, and avoided Z dollars in model calls due to the cache. Net contribution from the cache: Z minus Y. When that number is positive and large, the team is a sponsor of the cache and will defend it. When it is negative, the team should either improve hit rate, scope the cache differently, or drop out of the shared cache and have its embedding tax refunded.

Reporting this way also exposes a quieter problem: workloads that hit the cache heavily can erode the unit economics of the workloads that do not. If a high-hit-rate feature subsidizes the embedding tax for a low-hit-rate feature on the same shared backend, the low-hit-rate feature looks cheaper than it actually is. Per-workload chargeback unwinds that subsidy and forces a real decision about whether the cache scope is correct.

When the embedding overhead stops being worth it

Every cache has a population of workloads for which it is a tax with no return. The signals are blunt and easy to read once the accounting is in place. Hit rate sits under the break-even line for two consecutive weeks. Net cache contribution is negative on the chargeback. Cache writes outpace reads, meaning the workload is filling the store faster than it is querying it. Or the embedding spend for that workload exceeds five to ten percent of its total model spend, a rough ceiling above which the cache is no longer an optimization but a parallel cost surface.

When those signals fire, the right move is rarely to tune the threshold. It is to remove the workload from the cache, return the embedding budget to the team, and let them decide whether to rejoin once their traffic pattern changes. Workloads that produce long, fresh, low-repetition output - document analysis, code generation on private repositories, single-shot agentic plans - almost never recover. Workloads with a long tail of duplicated questions, stable retrieval contexts, or classification flows usually do.

The discipline is to make this a calendared review rather than a one-off cleanup. A quarterly cache review, owned by FinOps and signed off by the platform team, prevents the cache from accumulating dead tenants that quietly drag cost-per-lookup upward for everyone still on it.

Threshold changes are budget changes

Similarity threshold is usually treated as a quality knob. It is also a budget knob. Loosening the threshold raises hit rate and lowers cache cost-per-lookup for the workloads that match more broadly, while raising the risk of bad answers that downstream teams may not feel until later. Tightening the threshold improves precision but moves spend back into the model column.

From an attribution standpoint, any threshold change should be logged with the cost impact it is expected to produce. A move from 0.95 to 0.92 on a customer-support endpoint is worth a dollar number, not just a quality note. Recording that estimate alongside the change gives finance a way to reconcile next month's invoice against the engineering decisions that drove it, which is the heart of useful LLM FinOps.

Cache cost is real cost

The shorthand version of this page: treat the semantic cache as a cost center with its own P&L. Allocate the embedding tax at the lookup, credit savings to the workload that hit, expose net contribution per team on the chargeback, and review the tenant list on a calendar. Cache infrastructure that is not allocated this way drifts into the platform budget, gets defended on vibes, and is the first thing cut when costs rise - which is the opposite of what the FinOps function should produce.

Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research