OpenTelemetry GenAI conventions for cost attribution
OpenTelemetry GenAI semantic conventions define a standard vocabulary of attributes for tracing and metering LLM calls. They enable observability backends, cost attribution engines, and billing systems to speak the same language about which model was invoked, how many tokens were consumed, and whether the tokens came from cache or live inference. Without a standard schema, each vendor's tracing layer invents its own attribute names, forcing teams to maintain translation maps for every new tool they adopt.
The GenAI conventions are published and versioned by the OpenTelemetry specification maintainers. Adopting them means your traces will interoperate across Langfuse, Helicone, LiteLLM, custom collectors, and future platforms without remodeling or migration.
Core attributes for cost tracking
- gen_ai.system. The LLM provider (e.g., "openai", "anthropic", "azure", "bedrock", "vertex"). Joins spans to price tables keyed by vendor.
- gen_ai.request.model and gen_ai.response.model. Request model (what you asked for) and response model (what you got). For routed or fallback scenarios, these may differ. Essential for pricing because models within the same vendor carry different rates.
- gen_ai.usage.input_tokens and gen_ai.usage.output_tokens. Counts of input and output tokens in the request. Decimal precision required; floats corrupt cost math at scale.
- gen_ai.usage.cache_read_tokens and gen_ai.usage.cache_creation_tokens. Tokens read from cache (OpenAI ~50% discount, Anthropic ~90%) and tokens written to cache. Conflating these with normal input tokens underestimates the value of caching.
- gen_ai.operation.name. Human-readable operation identifier (e.g., "chat.completion", "embedding.create"). Helps group spans by function in cost reports.
- gen_ai.request.id and gen_ai.session.id. Request correlation ID and conversation/session identifier. Required for chargeback and for tracking multi-turn cost per conversation.
Why a standard schema powers attribution
Cost attribution collapses without schema consistency. A standard set of attributes means:
- Vendor-neutral joins. Price the same span in OpenAI, Anthropic, or Bedrock by changing the gen_ai.system attribute and the underlying price table. No reparse.
- Observability backend swaps. Move from Langfuse to Helicone without losing cost granularity because both emit the same attribute names.
- Chargeback without bespoke math. Customer and feature tags propagate consistently across all downstream systems. A billing system that understands gen_ai.usage.input_tokens and gen_ai.system doesn't need special code per team.
- Cache and fallback handling. Separate token types and model mismatch detection prevent the most common cost-math errors.
Enriching spans with business metadata
Cost attribution requires more than LLM attributes. Instrument spans with:
- Feature name. Which product capability or workflow incurred this cost. Enables cost-per-feature and identifies expensive user paths.
- Account or tenant ID. Required for SaaS chargeback and per-customer profitability. Apply privacy-aware scrubbing; never log the full customer name.
- Plan or tier. Segment cost by customer segment (free, pro, enterprise) to detect tier-specific issues.
- Environment. Production vs. staging vs. eval. Eval and testing spend should never bill customers.
- User or request ID. Required for detailed per-user cost analysis and for debugging cost outliers.
Attach these in a middleware or instrumentation layer so they flow into every span without repeating in application code.
Common pitfalls
Sampling out the wrong spans. Sampling for observability performance is standard. But if your sampler drops spans from high-cost customers or excludes fallback attempts, your cost attribution will be off. Use a separate cost-aware sampler that preserves all LLM calls even if it samples other telemetry.
Losing exact model versions. If you record only "gpt-4" without the version (e.g., "gpt-4-turbo-2024-04-09"), pricing is ambiguous. OpenAI retired gpt-4-32k and replaced it with a cheaper variant under the same name. Always capture full model identifiers.
Conflating cache reads with normal input. A span with 10k cache-read tokens and 500 new input tokens has very different cost implications than 10.5k normal input. If your aggregation adds them together, you overestimate input cost and underestimate cache value.
Skipping fallback annotations. If a request tried GPT-4 and fell back to GPT-3.5, both should appear in the trace. A single span with only the fallback model hides the extra cost of the first attempt and skews per-model percentages.
Applying the rule
Cost data is only as trustworthy as the schema it flows through. Standardize on GenAI semantic conventions early; migration after the fact is expensive.
Related
- LLM cost monitoring — operational metrics and alerting.
- LLM cost attribution — chargeback models and segmentation.
- LLM token tracking — usage analytics and forecasting.
- AI observability — tracing, metrics, and cost signals.
Want this applied to your own LLM spend? FinOps LLM runs a free audit of your AI costs and shows where the savings are. Book free audit →