Latency budgets you don't see until production
Latency in AI systems is end-to-end. The user's stopwatch starts on click and stops when they see a response, and the model call is rarely the slowest thing in between. A chat agent that takes 4.8 seconds in production may have a 1.2-second model call buried inside it, with the rest of the time spent elsewhere.
A trace makes it visible: the model takes 1.2s, but before that there's a vector search (200ms), a cold-cache hit on the user profile lookup (300ms), a tool call to a deprecated REST API (1.6s), a JSON parse-and-validate that's run synchronously (100ms), and 1.4s of response streaming through a load balancer that wasn't configured for streaming (it buffers the full response before sending). The model is the third-slowest thing in the stack.
This is the standard production-latency pattern. The visible component (the model) gets all the optimisation attention. The invisible components (everything around the model) accumulate quietly and dominate the actual user experience.
Sources of latency
Production systems accumulate latency in three categories.
First, retrieval and lookups. RAG systems do a vector search before every LLM call. Conversational systems do user lookups, conversation history fetches, permission checks. Each of these is a database round-trip, and round-trips are expensive on warm caches and far more expensive on cold ones. The first conversation a new user has will be slower than every subsequent one, often by a factor of two or more, because every cache is cold. Production-quality systems pre-warm caches on session start and instrument every database call.
Second, tool calls. Agent systems chain external API calls. Each external call has its own SLA, and external APIs (especially internal ones built without SLA discipline) have unbounded p99s. A single 5-second tail on a single tool call doubles the user's perceived latency. The mitigation is timeouts: every tool call has a hard timeout, every timeout is logged, every timeout is treated as a measured event, not a one-off.
Third, serialisation, validation, and streaming overhead. Response streaming is the most underrated optimisation in modern AI systems. A 4-second response that streams from the first token feels half as long as a 4-second response that arrives all at once at the end. Every LLM call should stream. The path between the LLM and the user (load balancer, CDN, edge function) should be configured for HTTP streaming end to end. Many production systems break this somewhere in the middle, and the response only flushes when the LLM finishes generating.
The discipline that helps most is the latency budget. Decide, for each user-facing flow, the total time budget (e.g., 3 seconds to first token, 8 seconds to complete response). Allocate that budget across the components: model call gets 1.5s, retrieval gets 0.4s, tool calls get 0.8s, overhead gets 0.3s. Instrument each. When any component exceeds its budget, alert. This sounds basic and it almost never gets done.
A second useful discipline: profile p99, not p50. Average latency hides the long tail that defines user experience. If your p50 is 2 seconds and your p99 is 12 seconds, one in a hundred users is having a bad day, and they're the ones writing the support tickets that mention slowness.
Latency engineering for AI systems is mostly about the non-AI parts. The model is fast and getting faster. The infrastructure around it is the variable. Teams that ship fast systems treat the model as one budget item among many. Teams that don't, treat the model as the system, and discover their actual latency story the day a customer complains.
# observability/budget.py: every span is measured against a declared budget
BUDGET_MS = {"retrieval": 400, "model": 1500, "tools": 800, "overhead": 300}
def timed(stage: str):
def decorator(fn):
def wrapper(*args, **kwargs):
t0 = time.perf_counter()
try:
return fn(*args, **kwargs)
finally:
ms = (time.perf_counter() - t0) * 1000
metrics.histogram(f"latency.{stage}", ms)
if ms > BUDGET_MS[stage]:
log.warn(f"{stage} over budget: {ms:.0f} > {BUDGET_MS[stage]}ms")
return wrapper
return decoratorEvery latency-bearing stage carries its declared budget. Over-budget calls produce a warning, not a 3am incident weeks later.