Skip to content
6 min

Cost per inference is the metric that ages projects

A team ships a customer support agent in March. The pilot ran on 50 customers a day, the prompts were 600 tokens, and the bill was $40 a month. Everyone is delighted. The product takes off. By August it's 5,000 customers a day, the prompts have grown to 1,800 tokens (longer system prompts, more retrieved context), the average conversation is now eight turns instead of three, and the monthly bill is $32,000.

Nobody noticed the ramp because cost per inference was never a tracked metric. The accuracy dashboard updated daily. The latency dashboard updated daily. The cost dashboard didn't exist.

This is the most common silent failure of GenAI projects, and the reason is structural. In every other ML world, cost is dominated by infrastructure choices made once: GPU hours, server count, batch size. You set the bill at deploy time and it doesn't drift much. With hosted-API GenAI, cost is dominated by usage patterns that drift continuously. Every prompt change, every memory mechanism, every retrieval expansion, every safety filter that adds another LLM call. All of these change the per-call cost, and they change it gradually enough that no single PR review catches the trend.

Cost has to be a first-class metric on day one of the project, not on day one of the budget review. The shape of this is dull but specific.

What to track

Track cost per call. Every API call gets logged with input tokens, output tokens, model, and computed cost. This is one tag on every span. The data sits next to the latency and the response, and it rolls up to a dashboard nobody has to ask for.

Track cost per user-outcome. Cost per call is interesting; cost per resolved-ticket, cost per generated-document, cost per qualified-lead is the unit that matters. Calculate this from the call costs and the outcome events. Set a target. Alert when it drifts.

Set hard budgets per environment. Production has one. Staging has another. A development environment has a third. Each one has an automated cutoff: when daily spend exceeds threshold, requests are rate-limited, not silently allowed. This catches the runaway eval loop, the looping agent, the prompt injection that made the model retry forever. Without it, your incident is a Slack notification at 3am from your finance team.

Build the cheapest model that meets the eval gate. There's a strong cultural pull toward 'just use the most capable model'. It's almost always wrong. Once the eval set exists, run every candidate model against it and ship the cheapest one that clears the bar. The capability ceiling is rarely the binding constraint; the eval set is. Most production work that uses Claude Sonnet would be just as good on Haiku at a fraction of the cost. Most production work on the largest GPT model would be just as good on a smaller variant. The two-line change in the API call is the most leveraged refactor of the year.

Audit periodically. Once a quarter, sample 100 actual production calls and compute their costs. Compare to the eval set's costs. The two should match within a small factor. When they don't, you've discovered something interesting: the production traffic shape has drifted, or a feature shipped that increased average prompt length, or someone added a tool call you weren't aware of. Cost drift is data drift's underrated cousin.

Cost is a measured property of the system, not a planning assumption. The team that treats it as planning learns the actual number from a spreadsheet someone in finance has been working on quietly. The team that treats it as measured catches the drift in week four when the trend line first turns upward. The difference between those two teams is whether they're in control of the bill or whether the bill is in control of them.

// The artefact
# observability/cost.py: every call carries its own cost tag
def call_with_cost(model: str, messages: list[dict]) -> tuple[str, float]:
    response = client.messages.create(model=model, messages=messages)
    in_toks, out_toks = response.usage.input_tokens, response.usage.output_tokens
    cost = price_in[model] * in_toks / 1e6 + price_out[model] * out_toks / 1e6
    metrics.emit("inference.cost", cost, tags={"model": model})
    metrics.emit("inference.tokens", in_toks + out_toks, tags={"model": model})
    return response.content[0].text, cost

Cost-per-call is one extra metric per span. It rolls up to a dashboard, alerts when it drifts, and removes the surprise.