Prompt caching explained

Prompt caching is a provider-native optimization where LLM APIs store the processed prefix of a prompt and apply a discount when that prefix is re-read. The model provider parses the prefix once, computes embeddings and key-value caches, and reuses them across subsequent requests. Cache hits reduce token costs significantly—OpenAI discounts cache-read input tokens at roughly 50% of normal input cost, while Anthropic discounts them at roughly 90% of normal input cost. The financial benefit depends on how stable your prompt prefix is and how often the same prefix repeats across requests.

How prompt caching works across providers

Structuring prompts for cache hits

Cache hits require byte-identical prefixes. Small differences break the cache silently. Use these patterns to maximize stability:

Mistakes that silently disable caching

FinOps: tracking cache tokens for attribution

Cache-read tokens cost far less than normal input tokens. If you don't track them separately in your cost model and attribution, your savings estimates become fiction. When a request uses 8k cache-read tokens and 500 new input tokens, conflating them corrupts your baseline. You need three token counts: input (normal), cache-read (cached prefix), and output. Your cost calculation then uses the provider's discount to compute actual spend per request, and your attribution model reports both gross tokens and net cost.

Without separate tracking, teams celebrate cost savings that don't exist (because they're measuring gross tokens, not net cost) or fail to see real savings because their baseline is built on incorrect token pricing.

The golden rule

Stable prefix, variable suffix, byte-identical every time. Then measure cache-read tokens separately or your numbers lie.

Related


Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research