Prompt caching explained
Prompt caching is a provider-native optimization where LLM APIs store the processed prefix of a prompt and apply a discount when that prefix is re-read. The model provider parses the prefix once, computes embeddings and key-value caches, and reuses them across subsequent requests. Cache hits reduce token costs significantly—OpenAI discounts cache-read input tokens at roughly 50% of normal input cost, while Anthropic discounts them at roughly 90% of normal input cost. The financial benefit depends on how stable your prompt prefix is and how often the same prefix repeats across requests.
How prompt caching works across providers
- OpenAI. Automatic for sufficiently long prompts (typically ≥1024 tokens). The model tracks the longest common prefix across requests and caches it. No code change required, though you can optimize by putting stable content first. Cache reads cost approximately 50% of standard input tokens; cache writes cost slightly more than normal input to pay for caching overhead.
- Anthropic. Explicit cache control via breakpoints in the message list. Add
cache_control: {"type": "ephemeral"}to the last message block you want cached (usually the system prompt and tools). Cache reads are discounted to approximately 10% of normal input token cost. Cache writes cost 25% more than normal input. Requires structural awareness: you place the cache breakpoint, not the model. - Amazon Bedrock. Supported on select models (Anthropic Claude via Bedrock). Uses the same explicit cache_control breakpoint mechanism as Anthropic's native API. Availability depends on the model and region; check Bedrock's current documentation for supported configurations.
Structuring prompts for cache hits
Cache hits require byte-identical prefixes. Small differences break the cache silently. Use these patterns to maximize stability:
- System prompt and tools first, user content last. Put your stable instructions, tool definitions, and few-shot examples before the variable user request. The longer the stable prefix, the greater the savings.
- Byte-identical across requests. Any whitespace, formatting, or punctuation change invalidates the cache. Avoid dynamic timestamps, random IDs, or computed values in the prefix.
- Avoid per-user system prompts. If you customize the system prompt per user (e.g., "You are an assistant for Alice"), the cache will never hit. Instead, keep the system prompt universal and pass user context in the variable part of the message.
- Use deterministic JSON or markdown for tools. If your tool definitions are code-generated, ensure the generation is deterministic. Reordering fields, adding newlines, or changing indentation breaks the cache.
- Separate cache breakpoint from request boundary. Some systems split messages at request time. Ensure the split happens after your stable prefix, not in the middle of it.
Mistakes that silently disable caching
- Reordering tools or examples. If your orchestration layer re-sorts function definitions or few-shot examples on each request, the cache never hits even though the content is identical.
- Injecting dynamic data early in the prefix. Conversation IDs, request timestamps, or user agent strings in the system prompt or early messages break cache consistency across requests.
- Per-user system prompts. "You are helping {username}" or "Your tone should match {brand_voice}" in the system prompt means the prefix changes every request.
- Nondeterministic serialization. If your message serialization uses unordered dicts, set iteration, or floating-point rendering, identical logical content produces different bytes and misses the cache.
- Intermixing cached and uncached messages. Some frameworks make it easy to add preamble text conditionally. If your cache breakpoint floats or messages before it vary, cache hits become unreliable.
FinOps: tracking cache tokens for attribution
Cache-read tokens cost far less than normal input tokens. If you don't track them separately in your cost model and attribution, your savings estimates become fiction. When a request uses 8k cache-read tokens and 500 new input tokens, conflating them corrupts your baseline. You need three token counts: input (normal), cache-read (cached prefix), and output. Your cost calculation then uses the provider's discount to compute actual spend per request, and your attribution model reports both gross tokens and net cost.
Without separate tracking, teams celebrate cost savings that don't exist (because they're measuring gross tokens, not net cost) or fail to see real savings because their baseline is built on incorrect token pricing.
The golden rule
Stable prefix, variable suffix, byte-identical every time. Then measure cache-read tokens separately or your numbers lie.
Related
- Prompt cache attribution — how to split savings across teams.
- Semantic cache economics — caching beyond byte-matching.
- How LLM providers charge — cost mechanics across platforms.
Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →