AI observability for production teams
AI observability is the discipline of seeing what your production LLM workloads actually do — in cost, traffic, and quality terms — on a single timeline. It is not just tracing prompts, and it is not just dashboards over the provider invoice. It is the join of the two, with enough metadata to act on what you see.
Most teams already have pieces of it: an LLM tracing tool like Phoenix, Langfuse, or Helicone for request inspection, plus a billing dashboard inside each provider console. The gap is between them. Observability becomes useful when one trace can be linked to one line on the invoice and one product feature.
What AI observability needs to cover
Cost signal
Per request: input tokens, output tokens, cache-read tokens, model, provider, estimated cost in cents. Cache-read tokens belong in their own column because their pricing is materially different from normal input tokens.
Traffic signal
Volume by feature, environment, customer or workspace, and team. Without these tags every spike looks the same.
Quality signal
Latency, error rate, retries, refusal rate, and a quality proxy (a rubric, eval, or downstream success metric) per workload. A cost drop with a quality drop is not a win.
Reconciliation signal
A monthly view that compares internal estimates against the actual provider invoice, with the delta explained.
How AI observability differs from LLM observability
LLM observability tools focus on the request and the model. AI observability adds the spend layer and the product layer. The same trace must be answerable for three audiences: an engineer debugging a regression, a product manager debugging a feature, and a finance partner debugging a line item.
Common observability anti-patterns
- Tagging requests only with the model name. The model is not the unit of accountability; the feature is.
- Storing prompts and outputs in the same store as cost data. It creates a data-handling burden and rarely pays back.
- Estimating cost without reconciling to the invoice. Estimates drift; only reconciliation tells you the drift.
- One dashboard for engineering and a different one for finance. They will disagree, and the disagreement will not get resolved.
What good looks like
- Every request carries: feature, environment, workspace, model, provider, input tokens, output tokens, cache-read tokens, latency, status, estimated cost.
- Cost and quality live on the same timeline. Spikes are explainable in both dimensions.
- Anomaly alerts route to the owning team, not a generic channel.
- The monthly close reconciles estimate to invoice within a known tolerance.
- Engineering, product, and finance look at the same view.
Where to start
Start with attribution, not dashboards. Until requests are tagged with feature and workspace, no dashboard will tell you something you did not already know. Once tagging is in, the rest follows: anomalies, chargeback, forecasting, and routing decisions all build on the same labelled stream.