What are the two prices on every model: input and output?

Input tokens are everything you send the model (prompt, system instructions, documents, conversation history) and are typically cheaper. Output tokens are everything the model generates and are several times more expensive per token. This asymmetry matters because the two costs are driven by different factors.

What is cache reads: the discount most teams forget?

Major providers serve repeated prompt prefixes from cache rather than processing fresh. They bill cache-read tokens at steep discount - roughly 50% on OpenAI and up to 90% on Anthropic. Most teams fail to track this separately, missing significant savings opportunities.

What is the per-request math?

Cost = (input tokens at full price) + (cache-read tokens at discount) + (output tokens including reasoning) - all multiplied by the specific model's price. Understanding this formula lets you predict costs and identify optimization opportunities.

How do LLM providers charge?

Q: What are model tiers and why do they matter?

Providers offer a ladder from small/cheap to large/capable models. The price gap is enormous - frequently more than an order of magnitude per token between the top and bottom. The same task costs 10-50× more depending on which model tier handles it.

Q: What are reasoning tokens and why are they hidden?

Reasoning models generate internal 'thinking' before producing visible output. Those reasoning tokens are billed like output tokens but never shown to the user. On hard prompts they can dominate cost - sometimes several times the size of the visible answer.

Updated 11 June 2026 · first published 31 May 2026

Almost every large-language-model provider bills the same way underneath the marketing: you pay per token, and the price per token depends on which model you call and whether the token was read in, generated out, or served from a cache. There is no seat license and no flat monthly fee on the usage-based APIs - the meter runs on every request. Understanding that meter is the difference between a predictable AI bill and a monthly surprise.

This is a plain-English explainer for anyone - finance, product, or engineering - who has looked at an OpenAI or Anthropic invoice and wanted to know what the numbers actually represent.

The unit: a token

A token is a chunk of text, usually a few characters long. As a rough rule of thumb, a token is about three-quarters of a word in English, so 1,000 tokens is roughly 750 words. Providers count tokens, not words or characters, and they publish prices per million tokens. Everything on your invoice traces back to how many tokens of each type your workloads consumed.

The two prices on every model: input and output

Every model has at least two prices, and they are not the same number:

Input tokens - everything you send the model: your prompt, the system instructions, any retrieved documents, and the prior conversation history. These are usually the cheaper of the two.
Output tokens - everything the model generates back. Output is typically several times more expensive per token than input.

This asymmetry matters because the two costs are driven by different things. Input cost grows when you stuff more context into the prompt - long system prompts, deep retrieval, full conversation history. Output cost grows when the model writes more - verbose answers, high max tokens limits, or formats that pad the response. A request that reads a lot and writes a little has a very different cost shape from one that reads a little and writes a lot, even on the same model.

Tiers: the same task can cost 10–50× more depending on the model

Providers offer a ladder of models, from small and cheap to large and capable, and the price gap between the top and bottom of that ladder is enormous - frequently more than an order of magnitude per token. The most common avoidable cost in production is sending every request to the most capable (most expensive) model when a smaller one would have answered just as well.

That is the entire premise of model routing: match each request to the cheapest model that still meets your quality bar, instead of defaulting everything to the flagship. Because the per-token gap between tiers is so large, getting routing right is often the single biggest lever on the bill.

Cache reads: the discount most teams forget to track

Major providers can serve a repeated prompt prefix from a cache rather than processing it fresh, and they bill those cache-read tokens at a steep discount. As of 2026 the discount is roughly 50% on OpenAI and up to about 90% on Anthropic, though you should always check current provider pricing.

The trap is that cache reads look like ordinary input tokens unless you separate them. If you build a cost baseline that treats every input token at full price, you will overstate your spend and misjudge your savings. Conversely, if a deploy quietly breaks cache hits - by changing a system prompt that used to be stable - your bill can climb sharply with no change in traffic. Tracking cache reads as their own line is fundamental to an accurate baseline.

Reasoning tokens: billed, but invisible

Reasoning models generate a chain of internal "thinking" before they produce a visible answer. Those reasoning tokens are billed like output tokens but never shown to the user. On hard prompts they can dominate the cost of a request - sometimes several times the size of the visible answer. Teams that adopt a reasoning model and only budget for the output they can see are routinely surprised by the invoice.

The other modifiers

Batch APIs. Most providers offer an asynchronous batch mode at a discount (often around half price) for work that can tolerate delay - evals, backfills, bulk summaries.
Context window. Larger context windows cost more per request simply because you are sending more input tokens. Bigger is not free.
Multimodal. Images, audio, and video are converted into tokens too, and multimodal costs can climb faster than text.
Built-in tools and retrieval. Provider-hosted tools, web search, and file search often carry their own per-call charges on top of token cost.

The per-request math

Put together, the cost of a single request is roughly:

(input tokens at full price) + (cache-read tokens at the discounted price) + (output tokens, including any reasoning tokens) - all multiplied by the price of the specific model that handled it.

Your monthly invoice is that calculation summed across every request, every model, every day. The reason invoices feel unpredictable is that all five variables - volume, model mix, prompt size, output length, and cache-hit rate - move independently, and a single code change can shift any of them without touching the others.

Why this matters for budgeting

Because providers bill per token and per model, you cannot control AI cost from a dashboard that only shows total spend. You control it by knowing which features drive which token shapes on which models - that is cost attribution - and then pulling the levers that the pricing model exposes: route to cheaper models, cache stable prefixes, compress prompts, and batch the work that can wait. The invoice is just the sum of those choices.

Cost per request as a product KPI - turning provider pricing into product metrics.
Input, output, cache, and reasoning tokens - what each token type costs and why.
Why does my LLM bill spike? - the common causes of sudden cost growth.
What is LLM cost attribution? - tracing spend back to an owner.

Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research