Input, output, cache, and reasoning tokens

When you read an LLM invoice, every dollar resolves to one of a few token types — and they do not cost the same. Treating them as one undifferentiated pile of "tokens" is the most common reason a cost baseline is wrong and a savings estimate is off. This is a plain-English guide to the four token types that matter, what each costs relative to the others, and which lever reduces each.

1. Input tokens

What they are: everything you send the model — the user's message, your system prompt, retrieved documents, tool definitions, and the prior conversation history.

Relative cost: the cheaper of the two billed-at-full-price types, but rarely small. Input is often the largest token count in retrieval-heavy and long-conversation workloads, so low unit price does not mean low total cost.

What drives it up: long system prompts, deep retrieval, full chat history replayed on every turn, oversized tool schemas.

Levers: prompt compression, trimming history, tightening retrieval depth, and moving stable prefixes into a prompt cache (see below).

2. Output tokens

What they are: everything the model generates back to you — the visible answer.

Relative cost: typically the most expensive token type per unit, often several times the price of input tokens on the same model.

What drives it up: high max tokens limits, verbose or "detailed" response styles, formats that pad output (large JSON, repeated boilerplate), and chatty multi-turn designs.

Levers: cap output length to what the task needs, prompt for concision, use structured outputs that don't repeat context, and pick a smaller model where the quality bar allows.

3. Cache-read tokens

What they are: input tokens served from a provider's prompt cache instead of processed fresh, because the same prefix was seen recently.

Relative cost: heavily discounted. As of 2026, roughly 50% off on OpenAI and up to about 90% off on Anthropic — always check current provider pricing. This is the one token type that is cheaper, and the discount is large enough to change a baseline materially.

Why they're dangerous to ignore: cache-read tokens look like ordinary input tokens unless you separate them. Count them at full price and you overstate spend; lose them (because a deploy changed a once-stable prompt) and your bill spikes with no traffic change. Tracking cache reads as their own line is fundamental to an accurate number.

Levers: stabilise system prompts and shared context so they cache; keep volatile content (timestamps, per-request data) out of the cached prefix.

4. Reasoning tokens

What they are: tokens a reasoning model generates internally while "thinking," before it produces the visible answer. They are billed like output but never shown to the user.

Relative cost: priced as output, and on hard prompts they can exceed the visible answer by several times. They are the token type most likely to surprise a team that budgeted only for the output it can see.

What drives it up: adopting a reasoning model, raising the reasoning budget, or pointing reasoning models at tasks that don't need them.

Levers: reserve reasoning models for tasks that genuinely benefit, tune the reasoning budget, and track reasoning tokens separately so their cost is visible.

Putting it together

A single request's cost is the sum of these four, each at its own price, multiplied by the model that handled it:

input (full price) + cache reads (discounted) + output + reasoning — × the model's per-token rate.

Two requests with the same total token count can cost very differently depending on the mix. A request that is mostly cache-read input is cheap; a request of the same size that is mostly output and reasoning is expensive. This is exactly why a cost baseline built on a single blended token price is unreliable — and why attribution has to break spend out by token type, not just by feature.

The one-line takeaway

Output and reasoning tokens are where the money usually goes; input is where the volume usually is; cache reads are the discount you must not lose. Track all four separately, per feature and per model, and the bill stops being a mystery.

Related


Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research