AI observability for production teams

AI observability is the discipline of seeing what your production LLM workloads actually do — in cost, traffic, and quality terms — on a single timeline. It is not just tracing prompts, and it is not just dashboards over the provider invoice. It is the join of the two, with enough metadata to act on what you see.

Most teams already have pieces of it: an LLM tracing tool like Phoenix, Langfuse, or Helicone for request inspection, plus a billing dashboard inside each provider console. The gap is between them. Observability becomes useful when one trace can be linked to one line on the invoice and one product feature.

What AI observability needs to cover

Cost signal

Per request: input tokens, output tokens, cache-read tokens, model, provider, estimated cost in cents. Cache-read tokens belong in their own column because their pricing is materially different from normal input tokens.

Traffic signal

Volume by feature, environment, customer or workspace, and team. Without these tags every spike looks the same.

Quality signal

Latency, error rate, retries, refusal rate, and a quality proxy (a rubric, eval, or downstream success metric) per workload. A cost drop with a quality drop is not a win.

Reconciliation signal

A monthly view that compares internal estimates against the actual provider invoice, with the delta explained.

How AI observability differs from LLM observability

LLM observability tools focus on the request and the model. AI observability adds the spend layer and the product layer. The same trace must be answerable for three audiences: an engineer debugging a regression, a product manager debugging a feature, and a finance partner debugging a line item.

Common observability anti-patterns

What good looks like

  1. Every request carries: feature, environment, workspace, model, provider, input tokens, output tokens, cache-read tokens, latency, status, estimated cost.
  2. Cost and quality live on the same timeline. Spikes are explainable in both dimensions.
  3. Anomaly alerts route to the owning team, not a generic channel.
  4. The monthly close reconciles estimate to invoice within a known tolerance.
  5. Engineering, product, and finance look at the same view.

Where to start

Start with attribution, not dashboards. Until requests are tagged with feature and workspace, no dashboard will tell you something you did not already know. Once tagging is in, the rest follows: anomalies, chargeback, forecasting, and routing decisions all build on the same labelled stream.

Related

Back to research