Cost per request as a product KPI

Product teams already track latency p95 and error rate per endpoint. They argue about them in sprint reviews, write them into PRD acceptance criteria, and page on them in production. Cost-per-request belongs in exactly the same conversation. It is the third leg of the product-quality tripod, and the one that almost every AI product team currently treats as a finance footnote rather than a feature-owner metric. Treating it as a KPI is what turns AI spend from a monthly surprise into something a product manager can plan against, allocate, and report on.

This page is about the mechanics of putting cost-per-request into the same documents and rituals product teams already use for performance. It is not a savings argument. The point is ownership, not minimization. A feature owner who knows what their endpoint costs per call can defend it, allocate it to the right cohort, and report it back to finance with a straight face. A feature owner who does not know is one provider price change away from a margin event nobody can localize.

Why cost-per-request needs to be a KPI, not a footnote

The reason latency and error rate became KPIs is that they are per-request, per-endpoint, and have an obvious owner. Anyone who can move latency can also degrade it, so the metric naturally finds an accountable team. Cost-per-request has the same shape. Every model call has a fully-loaded dollar number attached to it, that number is bounded by code that some team owns, and changes to that code move the number up or down in measurable ways. Everything required to treat it as a first-class product metric is already there. What is missing is the convention of writing it down next to the other two.

The cost of leaving it out is that AI spend stays attributed to a generic platform line nobody defends. Finance gets a monthly invoice and asks engineering what changed. Engineering points at recent launches but cannot say which one moved the number, by how much, or for which cohort. The conversation ends with a vague commitment to "watch it," which is what teams say when no metric exists. Adding cost-per-request to the product KPI set replaces that conversation with a variance review.

Defining it without ambiguity

A KPI that is not defined precisely will be measured inconsistently and argued about forever. Cost-per-request, fully loaded, is the sum of the provider-billed cost of every token attributable to a single user-visible request. That includes input tokens, output tokens, cache-read tokens at their discounted rate, cache-write tokens at their full rate, reasoning tokens when the model emits them, and the output tokens of any tool calls the request triggered. It is priced at the rate the provider actually billed, not at the rate the team wishes they had been billed. It is reported per endpoint, per feature, and per cohort, not as a global average.

Three definitional traps are worth naming. Cache-read tokens are cheaper than input tokens but they are not free, and a denominator that excludes them will drift from the invoice. Reasoning tokens on models that emit them can dwarf the visible output and must be counted. Tool-call output tokens are billable work performed on behalf of the originating request and belong in that request's number, not in a separate tool-owner bucket. Get those three right and the KPI will reconcile against the provider invoice to within rounding.

Where it goes in product docs

A KPI that lives only in a dashboard is not yet a KPI. It becomes one when it appears in the documents and rituals that gate product work.

In the PRD, cost-per-request belongs in the acceptance criteria next to latency and error rate. A launch-ready feature has a stated target cost-per-request at the cohorts the feature is shipping to, and a stated ceiling above which the launch is blocked pending review. The target is set by the feature owner in consultation with the team that owns the budget line, not handed down by finance.

In the release readiness checklist, cost-per-request gets a row. A release that increases the metric by more than an agreed threshold against the previous release is not automatically blocked, but it requires a written justification from the feature owner explaining what changed and which cohort is now carrying the increase. Quality improvements that legitimately cost more are fine; quality regressions that also cost more are not.

In the sprint review, cost-per-request sits next to the other endpoint metrics. The feature owner walks the team through the week's movement: which endpoints moved, which releases drove the movement, which cohorts were affected. The conversation is the same shape as the latency review the team already runs.

In the on-call runbook, a sustained spike in cost-per-request is an incident class. It does not always require a page, but it requires a triage path: who looks at it first, what data they pull, what the rollback criteria are. Cost incidents that get treated as finance problems get found weeks late; cost incidents that get treated as product incidents get found in hours.

The attribution layer that has to exist first

Cost-per-request as a KPI presupposes a working attribution layer. Every model call has to carry, at minimum, the endpoint it served, the feature tag the endpoint belongs to, the team that owns the feature, and the cohort or tenant the request was made on behalf of. Without those tags the metric is a global average, and a global average has no owner. With them, the metric rolls up per feature for the owning team and per cohort for the customer-success conversation.

Teams that try to add the KPI before the attribution layer end up with a number that nobody trusts and nobody defends. The order is fixed: tagging first, then the dashboard, then the document changes. Skipping the first step produces a metric that finance reports and product ignores, which is the failure mode this whole exercise is designed to prevent.

What a healthy review looks like

A weekly cost-per-request review has the same rhythm as a weekly latency review. The dashboard shows the metric per endpoint over the last seven days, broken out by cohort, with the previous week's value as the comparison. The feature owner identifies the endpoints that moved by more than the review threshold and walks the team through why. Releases that shipped during the window are cross-referenced against the movement. Cohort shifts — a large customer ramping up, a new tier launching, a free-tier promo driving low-value traffic — are called out by name. The review ends with either a clean bill of health or a written follow-up owned by a specific engineer.

The review is not a finance ritual. It is run by the product team, attended by the feature owners, and minuted in the same place as the latency review. Finance gets the output, not the process. That separation is what keeps the KPI a product metric rather than a budget metric.

Common patterns we see go wrong

Averaging across cohorts is the most common failure. A handful of high-volume customers can pull the global average so far from the median that the metric stops describing the typical request. Reporting cost-per-request without a cohort breakdown is reporting a number that protects the team from the conversation the number is supposed to start.

Ignoring the cache-read tax in the denominator is the second. Teams that exclude cache-read tokens from their cost calculation get a number that looks great in the dashboard and fails to reconcile against the invoice. The fix is straightforward: cache-read tokens count, at the discounted rate the provider actually billed.

Tracking cost-per-request only in finance, not in product, is the third. A metric that lives in a finance dashboard but not in a PRD is a reporting artifact, not a KPI. Product owners who do not see the number weekly will not move it, and finance reporting it back to them monthly is a lagging signal nobody can act on.

Chasing the metric without tracking quality is the fourth. Cost-per-request without a paired quality metric — task success, user thumbs-up, downstream conversion — invites the team to optimize the number by degrading the product. The KPI works only when it is reviewed alongside the quality signal it serves.

The chargeback hook

Once cost-per-request is owned per feature, monthly chargeback becomes arithmetic. Sum the requests in the period, multiply by the per-feature cost-per-request, and the team that owns the feature gets a line on their budget that matches what the invoice says. Per-tenant chargeback works the same way, summed across the tenants the feature served.

Without the KPI, chargeback is a guess. Finance allocates AI spend across teams using whatever proxy is available — headcount, last quarter's split, the loudest team's protest — and engineering treats the resulting numbers as fiction. With the KPI, the allocation is defensible at the row level and the conversation moves from "is this number fair" to "is this number worth the feature it bought."

A short rollout plan

Week one is tagging. Every model call carries feature, team, endpoint, and cohort, enforced at the gateway or the SDK. No new dashboards yet; the goal is to get the data clean before anyone looks at it.

Week two is the dashboard. Cost-per-request per endpoint, per feature, per cohort, with weekly and monthly views. Reconcile the rollup against the provider invoice and resolve any gap before the dashboard is shown to anyone outside the team.

Week three is the document work. Add the KPI to the PRD template, the release readiness checklist, and the on-call runbook. Pick a threshold for the release variance check that the team will actually enforce, not one that sounds rigorous but gets waived.

Week four is the review cycle. Cost-per-request is on the agenda of the weekly product review, owned by the feature owners, reported up to finance as a monthly rollup rather than a monthly surprise. By the end of the month the metric is part of how the team ships, and the chargeback conversation has somewhere to land.

Related

Back to research