How do LLM providers charge?
Almost every large-language-model provider bills the same way underneath the marketing: you pay per token, and the price per token depends on which model you call and whether the token was read in, generated out, or served from a cache. There is no seat license and no flat monthly fee on the usage-based APIs — the meter runs on every request. Understanding that meter is the difference between a predictable AI bill and a monthly surprise.
This is a plain-English explainer for anyone — finance, product, or engineering — who has looked at an OpenAI or Anthropic invoice and wanted to know what the numbers actually represent.
The unit: a token
A token is a chunk of text, usually a few characters long. As a rough rule of thumb, a token is about three-quarters of a word in English, so 1,000 tokens is roughly 750 words. Providers count tokens, not words or characters, and they publish prices per million tokens. Everything on your invoice traces back to how many tokens of each type your workloads consumed.
The two prices on every model: input and output
Every model has at least two prices, and they are not the same number:
- Input tokens — everything you send the model: your prompt, the system instructions, any retrieved documents, and the prior conversation history. These are usually the cheaper of the two.
- Output tokens — everything the model generates back. Output is typically several times more expensive per token than input.
This asymmetry matters because the two costs are driven by different things. Input cost grows when you stuff more context into the prompt — long system prompts, deep retrieval, full conversation history. Output cost grows when the model writes more — verbose answers, high max tokens limits, or formats that pad the response. A request that reads a lot and writes a little has a very different cost shape from one that reads a little and writes a lot, even on the same model.
Tiers: the same task can cost 10–50× more depending on the model
Providers offer a ladder of models, from small and cheap to large and capable, and the price gap between the top and bottom of that ladder is enormous — frequently more than an order of magnitude per token. The most common avoidable cost in production is sending every request to the most capable (most expensive) model when a smaller one would have answered just as well.
That is the entire premise of model routing: match each request to the cheapest model that still meets your quality bar, instead of defaulting everything to the flagship. Because the per-token gap between tiers is so large, getting routing right is often the single biggest lever on the bill.
Cache reads: the discount most teams forget to track
Major providers can serve a repeated prompt prefix from a cache rather than processing it fresh, and they bill those cache-read tokens at a steep discount. As of 2026 the discount is roughly 50% on OpenAI and up to about 90% on Anthropic, though you should always check current provider pricing.
The trap is that cache reads look like ordinary input tokens unless you separate them. If you build a cost baseline that treats every input token at full price, you will overstate your spend and misjudge your savings. Conversely, if a deploy quietly breaks cache hits — by changing a system prompt that used to be stable — your bill can climb sharply with no change in traffic. Tracking cache reads as their own line is fundamental to an accurate baseline.
Reasoning tokens: billed, but invisible
Reasoning models generate a chain of internal "thinking" before they produce a visible answer. Those reasoning tokens are billed like output tokens but never shown to the user. On hard prompts they can dominate the cost of a request — sometimes several times the size of the visible answer. Teams that adopt a reasoning model and only budget for the output they can see are routinely surprised by the invoice.
The other modifiers
- Batch APIs. Most providers offer an asynchronous batch mode at a discount (often around half price) for work that can tolerate delay — evals, backfills, bulk summaries.
- Context window. Larger context windows cost more per request simply because you are sending more input tokens. Bigger is not free.
- Multimodal. Images, audio, and video are converted into tokens too, and multimodal costs can climb faster than text.
- Built-in tools and retrieval. Provider-hosted tools, web search, and file search often carry their own per-call charges on top of token cost.
The per-request math
Put together, the cost of a single request is roughly:
(input tokens at full price) + (cache-read tokens at the discounted price) + (output tokens, including any reasoning tokens) — all multiplied by the price of the specific model that handled it.
Your monthly invoice is that calculation summed across every request, every model, every day. The reason invoices feel unpredictable is that all five variables — volume, model mix, prompt size, output length, and cache-hit rate — move independently, and a single code change can shift any of them without touching the others.
Why this matters for budgeting
Because providers bill per token and per model, you cannot control AI cost from a dashboard that only shows total spend. You control it by knowing which features drive which token shapes on which models — that is cost attribution — and then pulling the levers that the pricing model exposes: route to cheaper models, cache stable prefixes, compress prompts, and batch the work that can wait. The invoice is just the sum of those choices.
Related
- Input, output, cache, and reasoning tokens — what each token type costs and why.
- Why does my LLM bill spike? — the common causes of sudden cost growth.
- What is LLM cost attribution? — tracing spend back to an owner.
- Model routing — the biggest lever the pricing model exposes.
Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →