OpenTelemetry GenAI conventions for cost attribution

OpenTelemetry GenAI semantic conventions define a standard vocabulary of attributes for tracing and metering LLM calls. They enable observability backends, cost attribution engines, and billing systems to speak the same language about which model was invoked, how many tokens were consumed, and whether the tokens came from cache or live inference. Without a standard schema, each vendor's tracing layer invents its own attribute names, forcing teams to maintain translation maps for every new tool they adopt.

The GenAI conventions are published and versioned by the OpenTelemetry specification maintainers. Adopting them means your traces will interoperate across Langfuse, Helicone, LiteLLM, custom collectors, and future platforms without remodeling or migration.

Core attributes for cost tracking

Why a standard schema powers attribution

Cost attribution collapses without schema consistency. A standard set of attributes means:

Enriching spans with business metadata

Cost attribution requires more than LLM attributes. Instrument spans with:

Attach these in a middleware or instrumentation layer so they flow into every span without repeating in application code.

Common pitfalls

Sampling out the wrong spans. Sampling for observability performance is standard. But if your sampler drops spans from high-cost customers or excludes fallback attempts, your cost attribution will be off. Use a separate cost-aware sampler that preserves all LLM calls even if it samples other telemetry.

Losing exact model versions. If you record only "gpt-4" without the version (e.g., "gpt-4-turbo-2024-04-09"), pricing is ambiguous. OpenAI retired gpt-4-32k and replaced it with a cheaper variant under the same name. Always capture full model identifiers.

Conflating cache reads with normal input. A span with 10k cache-read tokens and 500 new input tokens has very different cost implications than 10.5k normal input. If your aggregation adds them together, you overestimate input cost and underestimate cache value.

Skipping fallback annotations. If a request tried GPT-4 and fell back to GPT-3.5, both should appear in the trace. A single span with only the fallback model hides the extra cost of the first attempt and skews per-model percentages.

Applying the rule

Cost data is only as trustworthy as the schema it flows through. Standardize on GenAI semantic conventions early; migration after the fact is expensive.

Related


Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →

Back to research