Skip to content
7 min

The hardest part of agentic AI is the failure modes

An agent sequences four tool calls to answer a customer question. Each step (deciding which tool to call, and the call succeeding) is reliable about 95 percent of the time end to end. Multiply: 0.95⁴ ≈ 81 percent. Roughly one in five interactions fails somewhere.

Then the harder question: when it fails, what does it do?

Most discussion of agent quality is about the success rate. The agent that picks the right tool 92 percent of the time vs the one that picks it 95 percent of the time. The agent that retrieves the right document 80 percent of the time vs 88 percent of the time. These numbers are the visible part of the iceberg. The waterline is below them, in the design of what the agent does in the 5 to 20 percent of cases where something goes wrong.

A 95 percent agent that fails gracefully is a usable system. A 95 percent agent that fails confidently is unusable, regardless of how good the 95 percent looks on a slide. The agent that books a meeting with the wrong person, or refunds the wrong invoice, or escalates a high-priority case as low-priority, is not '5 percent worse than the perfect version'. It is actively harmful, and the cost of cleaning up after it can erase the value of the 95 percent that worked.

The engineering that matters for production agents is the engineering of the failure modes. There are four shapes that recur.

The four shapes of failure

The confidently wrong tool call. The agent has decided to call get_customer_info but it has the wrong customer ID. The tool returns data, the agent uses the data, the answer is fluent, the answer is wrong. Mitigation: type-check the inputs to tools before execution, validate against known patterns (does the customer ID match a real customer?), and refuse to proceed if validation fails. Refusal is far cheaper than confident hallucination.

The infinite retry. The tool returns an error. The agent decides to call the tool again, possibly with the same input, possibly with a slight variation. Five seconds and 40,000 tokens later, the agent is still trying. Mitigation: hard caps on tool call counts, hard caps on conversation turn counts, hard caps on total token spend per session. Every agent should have a circuit breaker.

The wrong-domain answer. The user asks about something the agent shouldn't be answering (legal advice, medical advice, a question outside the agent's defined scope). The agent obliges fluently. Mitigation: scope the agent narrowly in the system prompt, build refusal into the eval set, and treat scope adherence as a measured metric, not an aspirational goal.

The silent escalation failure. The agent decides this case is too hard and escalates to a human. The escalation message goes to a queue nobody monitors. The customer waits four days. Mitigation: every escalation has a named owner, a tracked SLA, and an alert if it sits unread.

The discipline that catches all of these is observability. Every agent decision (which tool, with what inputs, returning what output) is logged with a structured schema, queryable, sampled into a review queue. The first hundred production interactions get manually reviewed. Patterns emerge: a particular input shape that consistently confuses the tool router, a particular customer segment that the agent struggles with, a particular tool whose error messages the agent does not handle. Each pattern becomes a fix, and each fix becomes a regression test in the eval set.

Agentic AI is not chat-with-extra-steps. It is a software system with non-deterministic components, deployed to act on the world. The right question to ask of any production agent is not 'is it 90 percent accurate?' but 'what does it do at the 10 percent that it doesn't get right?'. The teams that ship reliable agents have answered this question concretely, in writing, before the agent goes live. The teams that haven't, ship something demoable that breaks two weeks later in ways nobody anticipated.

Put simply: the agent's ability to refuse, retry, and escalate is more important than its ability to answer. Build the failure modes first. The success path mostly takes care of itself.

// The artefact
# agents/runtime.py: every agent run sits inside a budget and a validator
def run_with_budget(agent, query: str, budget: Budget) -> AgentResult:
    for step in range(budget.max_steps):
        action = agent.decide(query)
        if not validate_tool_call(action):     # type-check before any side-effect
            return AgentResult.refuse("invalid tool args")
        if budget.tokens_used > budget.max_tokens:
            return AgentResult.escalate("budget exceeded")
        observation = run_tool(action)
        if agent.is_done(observation):
            return AgentResult.answer(observation)
    return AgentResult.escalate("step limit reached")  # never silently loops

A circuit breaker around every agent. Validate before acting, cap the spend, escalate cleanly when stuck. The 5 percent failure path matters more than the 95 percent success path.