Multimodal cost allocation

Most LLM allocation models were built when AI spend meant text in, text out. A single token-price table, a tag per request, and a monthly chargeback report were enough. Once image inputs, image generation, audio transcription, and realtime voice sessions enter the same product, that model stops working. Modalities have different unit economics, different ownership patterns, and different abuse profiles, and a flat "AI cost" line on the engineering ledger hides all of it. This page is about how to allocate multimodal spend so each team sees what it actually consumes, and finance has a defensible chargeback story.

Why one unit doesn't fit every modality

Text spend is tidy: input tokens, output tokens, cached input tokens, all priced per million. Image and audio do not flatten into the same unit cleanly. An image input is billed by tile count derived from resolution and detail mode. Image generation is billed per image, but the price varies by quality tier and output size. Realtime audio is billed per minute of session, with separate input and output rates and sometimes a different rate for voice vs transcript. Trying to express all of that as a single "tokens" column on a chargeback report either rounds away real differences or invents a synthetic unit that nobody trusts.

The practical move is to keep modality-native units in the ledger and only convert to dollars at the allocation layer. Store image-tile counts on image rows, audio seconds on audio rows, and tokens on text rows. The unit price comes from a per-modality price book, not from a global token rate.

The minimum tag set for multimodal attribution

Single-tenant text systems can often get by tagging only team and feature. Multimodal raises the dimensionality. The tag set that holds up under chargeback review usually includes:

If any of these tags are missing at the call site, the allocation engine has to guess, and finance ends up arguing about the guess every month.

Shared image service vs team-owned generation

Image generation is the most common shared service in a mature AI org. One platform team owns a "make an image" endpoint that wraps prompt safety, brand checks, content moderation, and a small bench of providers. Marketing, product, support, and the docs team all call it. The convenient thing about a shared service is also the dangerous thing: by default, the cost lives on the platform team's budget and no caller sees what they cost.

Two allocation patterns work here. The first is full passthrough: the shared service measures the underlying provider cost per call, adds a small operational margin if finance allows it, and books the line to the calling team's cost center. The platform team's own ledger only carries the infrastructure and a small unallocated pool for failed or moderated requests. The second is rate-card showback: the platform team publishes a fixed internal price per quality tier (e.g. "standard image: $0.02, HD image: $0.08") and bills callers against that card. Real provider cost lives on the platform team, and any spread between card and actual is the platform team's variance.

Passthrough is more accurate; rate cards are more predictable. Most orgs end up with a hybrid: a rate card for the common path so product teams can budget, and passthrough for anything outside the card (custom resolutions, experimental providers, on-demand large batches).

Realtime audio sessions as a quota category

Realtime voice is the modality most likely to break a naive allocation model. A single inattentive session left open in a kiosk or background tab can run for hours and produce a five-figure surprise. The cost driver is wall-clock session minutes, not request count, so per-request budgets are useless and per-team monthly caps are too coarse.

Treat realtime audio as its own quota category, separate from text and from non-realtime audio. Allocate it with three controls layered together. First, a per-session ceiling on duration, enforced server side, with a hard cut and a clear user-facing message. Second, a per-user or per-tenant daily minute budget, with overflow charged to a designated cost center rather than silently absorbed. Third, a team-level monthly minute budget that finance reviews like any other capacity line.

This three-layer model is what makes realtime allocatable. Without it, realtime sits on a single shared invoice line and the team that runs the voice agent pays for everyone else's stuck sessions.

Mapping modalities to cost centers

Cost centers in finance systems usually predate AI by years. They are functional: Sales, Marketing, Product, Support, R&D, Trust & Safety, Internal Tools. Modalities map to those cost centers differently than text does, and the mapping is worth making explicit.

The discipline is to write this mapping down once, get finance to sign off on it, and then enforce it in the allocation engine. Surprise re-mappings mid-quarter are how chargeback programs lose internal credibility.

Building the chargeback report

A defensible multimodal chargeback report has a few properties. It shows each team's spend broken out by modality, not just a single AI total. It separates production from eval and internal-tool traffic so launch months do not punish product teams. It lists the top surfaces by spend within each modality, so engineering leads can see which features are driving their line. And it reconciles to the provider invoice within a tolerance the finance team has agreed to up front — typically one or two percent — with a named unallocated bucket for the gap.

The unallocated bucket is non-negotiable. Without it, every rounding error and untagged request becomes a fight. With it, the conversation moves to shrinking the bucket over time rather than disputing individual lines.

Common failure modes

The patterns that quietly destroy multimodal allocation are usually procedural, not technical. A platform team adds a new modality and forgets to extend the tag schema, so three months of spend land in an "other" bucket. A product team A/B tests a higher image-quality tier and the experiment runs to 100% before anyone updates the rate card. A realtime feature ships without per-session ceilings and a single tenant burns through the team's monthly budget in a weekend. A support tool starts calling vision inputs at high detail because the prompt template changed, and the cost center sees a 4x lift with no product change.

Each of these is preventable with a tagging review, a rate-card review, and a per-modality budget review on the same cadence finance uses for everything else. Multimodal is not unmanageable; it just refuses to be managed as if it were text with a different label.

Related

Back to research