Input, output, cache, and reasoning tokens
When you read an LLM invoice, every dollar resolves to one of a few token types — and they do not cost the same. Treating them as one undifferentiated pile of "tokens" is the most common reason a cost baseline is wrong and a savings estimate is off. This is a plain-English guide to the four token types that matter, what each costs relative to the others, and which lever reduces each.
1. Input tokens
What they are: everything you send the model — the user's message, your system prompt, retrieved documents, tool definitions, and the prior conversation history.
Relative cost: the cheaper of the two billed-at-full-price types, but rarely small. Input is often the largest token count in retrieval-heavy and long-conversation workloads, so low unit price does not mean low total cost.
What drives it up: long system prompts, deep retrieval, full chat history replayed on every turn, oversized tool schemas.
Levers: prompt compression, trimming history, tightening retrieval depth, and moving stable prefixes into a prompt cache (see below).
2. Output tokens
What they are: everything the model generates back to you — the visible answer.
Relative cost: typically the most expensive token type per unit, often several times the price of input tokens on the same model.
What drives it up: high max tokens limits, verbose or "detailed" response styles, formats that pad output (large JSON, repeated boilerplate), and chatty multi-turn designs.
Levers: cap output length to what the task needs, prompt for concision, use structured outputs that don't repeat context, and pick a smaller model where the quality bar allows.
3. Cache-read tokens
What they are: input tokens served from a provider's prompt cache instead of processed fresh, because the same prefix was seen recently.
Relative cost: heavily discounted. As of 2026, roughly 50% off on OpenAI and up to about 90% off on Anthropic — always check current provider pricing. This is the one token type that is cheaper, and the discount is large enough to change a baseline materially.
Why they're dangerous to ignore: cache-read tokens look like ordinary input tokens unless you separate them. Count them at full price and you overstate spend; lose them (because a deploy changed a once-stable prompt) and your bill spikes with no traffic change. Tracking cache reads as their own line is fundamental to an accurate number.
Levers: stabilise system prompts and shared context so they cache; keep volatile content (timestamps, per-request data) out of the cached prefix.
4. Reasoning tokens
What they are: tokens a reasoning model generates internally while "thinking," before it produces the visible answer. They are billed like output but never shown to the user.
Relative cost: priced as output, and on hard prompts they can exceed the visible answer by several times. They are the token type most likely to surprise a team that budgeted only for the output it can see.
What drives it up: adopting a reasoning model, raising the reasoning budget, or pointing reasoning models at tasks that don't need them.
Levers: reserve reasoning models for tasks that genuinely benefit, tune the reasoning budget, and track reasoning tokens separately so their cost is visible.
Putting it together
A single request's cost is the sum of these four, each at its own price, multiplied by the model that handled it:
input (full price) + cache reads (discounted) + output + reasoning — × the model's per-token rate.
Two requests with the same total token count can cost very differently depending on the mix. A request that is mostly cache-read input is cheap; a request of the same size that is mostly output and reasoning is expensive. This is exactly why a cost baseline built on a single blended token price is unreliable — and why attribution has to break spend out by token type, not just by feature.
The one-line takeaway
Output and reasoning tokens are where the money usually goes; input is where the volume usually is; cache reads are the discount you must not lose. Track all four separately, per feature and per model, and the bill stops being a mystery.
Related
- How do LLM providers charge? — the full pricing picture.
- LLM token tracking — instrumenting token data in practice.
- Prompt cache attribution — getting cache reads right.
- Reasoning token attribution — the invisible line item.
Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →