A 20-step AI agent fails 64% of the time

Summary

Reliability multiplies, it doesn’t average - an agent that’s 95 percent accurate per step succeeds on a full 20-step run only about 36 percent of the time, because 0.95 to the 20th power is 0.358. The per-step number looks fine right up to the point the chain eats it.
The failure is invisible until late - Latitude’s production write-up shows how a misread at step 3 silently corrupts the steps that reason from it. You don’t see a crash, you see a wrong answer with no obvious cause.
Capability was never the gap - the popular Hacker News plea “less capability, more reliability” names it. Smarter models don’t fix compounding math.
A defined process contains what it can’t prevent - it pauses, escalates, retries, or routes a bad step to a human instead of letting it run. See how Tallyfy structures that

Here’s the number that should stall an autonomous-agent project before anyone writes code. Take an AI agent that gets each step right 95 percent of the time. That sounds excellent. Hand it a job that runs twenty steps and the whole thing succeeds about 36 percent of the time. Not 95. Thirty-six.

The arithmetic is dull and merciless. Reliability across a chain doesn’t average, it multiplies, and 0.95 to the twentieth power is 0.358. So a 20-step agent fails almost two runs in three, and it usually fails quietly, somewhere in the middle, where nobody’s watching the dashboard. That single fact reframes most of what work you can safely hand to an agent: the limit isn’t how smart the model is, it’s how long a leash you give it.

Solution Workflow & Process

Workflow Automation Software

Workflow Automation Software Made Easy & Simple

Save Time On Workflows

Track & Delegate Tasks

Consistency

Explore this solution

This is the part the “point an agent at your whole process” pitch skips. A demo runs eight steps on a clean input and lands every one. Then the real deployment runs twenty steps on messy data, a hundred times a day, and the product of all those near-misses is a coin flip with bad odds. Nothing made the model dumber between the demo and the rollout - the job just got longer.

Length is the enemy.

Run the multiplication before you build

Start with the calculator, not the vendor pitch. Pick a per-step success rate you actually believe, set the number of steps to match your real process, and watch the end-to-end number fall off a cliff. The drop is steeper than intuition says, every single time.

The shape of the curve is the whole lesson. At 99 percent per step - which is wildly optimistic for a model reasoning over open-ended inputs - a 20-step run still only clears about 82 percent. Drop to a realistic 90 percent per step and twenty steps land you near 12 percent. Honest production agents on genuinely hard tasks often sit lower than 95 per step, not higher, so the picture below is the generous version, not the pessimistic one.

Why AI needs one defined task

A job is just a bunch of tasks in a row. Drag the sliders and watch what happens when AI tries the whole job by itself.

Tasks in the job 10 How often AI nails one task 90% Tries per task with Tallyfy 3

AI does the whole job alone 35%

One slip-up anywhere and the whole job fails.

With Tallyfy: one task at a time 99%

Each task is checked, and tried again if it slips.

90% per task, 10 tasks in a row, is about 35%. A 10-step job done blind is worse than a coin flip.

Read why AI is for tasks, not jobs

Play with the retry slider and something useful happens. Add a check that catches a failed step and re-runs it, and the end-to-end number climbs back toward the top. That’s the entire argument in one widget: you don’t win by chasing a perfect model, you win by putting a net under each step. The math that looked hopeless at 36 percent turns survivable the moment something is allowed to notice a miss and act on it.

So the real question was never how accurate the model is.

What happens on the one step in twenty where it’s wrong?

Why nobody notices until step 7

A single bad step rarely announces itself. It corrupts the input to the next step, which produces a plausible-but-wrong output, which feeds the step after that. By the time the run finishes, the answer is confidently incorrect and the trail back to the root cause is cold. Latitude’s analysis of why agents break in production puts it cleanly: a misinterpretation at step 3 silently corrupts the context that steps 4 through 8 reason from. The damage was done early and stayed hidden.

Picture a procurement agent handed a contract renewal. At step three it reads the terms and quietly misreads a net-60 payment window as net-30. Nothing throws an error. Steps four through nine all behave correctly given that wrong input - they schedule the payment for the wrong date, route it to the wrong approver, and draft a confirmation email that reads perfectly. Nine clean checkmarks, one wrong outcome, and the whole mistake traces back to a single early read that everything downstream simply trusted. Run that pattern across a few hundred renewals a month and you’ve built a quiet error factory that passes every status check it sees.

This is what makes compounding failure so nasty compared to a normal bug. A crash is honest - it stops, it leaves a stack trace, you know where to look. A drifted agent run looks like success. It returns something. Somebody has to read the output carefully enough to realize the vendor record it updated was the wrong one, or the refund it calculated used last quarter’s policy. The error rate per step is small and the cost per missed error is large, which is the worst pairing an operations lead can inherit.

So what does that 36 percent actually cost you? The two-in-three runs that fail aren’t loud failures you can alert on. They’re quiet ones you find in an audit, weeks later, after the bad output already moved downstream.

Capability was never the bottleneck

The agent crowd keeps reaching for a bigger model when the problem is structural. Why throw more horsepower at a chain that breaks on coordination, not capability? A widely-read Hacker News thread asked for exactly the opposite - “AI agents: Less capability, more reliability, please” - and the discussion under it is full of engineers who learned the multiplication the hard way. One commenter, photonthug, argued the durable fix is to “build concrete interfaces with specific predefined vocabularies” rather than letting a model improvise its way through an unbounded task. Constrain the surface, and you constrain the ways it can go wrong.

The numbers corroborate across every chain length, not just twenty. MindStudio’s write-up on multi-agent reliability runs the same product rule at five: “Chain five agents at 95 percent reliability each and your end-to-end success rate collapses to 77 percent.” Five steps already bleeds a fifth of your runs. Twenty is a different universe. The lesson generalizes - the more independent steps you string together with no check between them, the closer the whole thing creeps to a gamble.

Five steps already leaks, and twenty hemorrhages.

The counterintuitive part is that the fix has almost nothing to do with the model you pick. You can swap in next year’s smarter model and move from 95 to 97 per step, and your 20-step run goes from 36 percent to 54 percent. Better, sure. Still a coin flip you wouldn’t bet payroll on. Autonomous, unsupervised, many-step agents are a dead end for real operations, and no amount of model progress in the near term changes the arithmetic underneath them.

A process can’t raise 95, but it can contain it

So if a smarter model won’t save you, what does? A defined process, sitting between the agent and your systems, doing the one job the model can’t do for itself: noticing.

A workflow engine doesn’t touch the 95 percent. It can’t make the model more accurate, and it doesn’t try. What it does is wrap each step in a container that decides what happens when the step is wrong: pause and wait for a human, escalate to an owner, retry with the same input, or route around to a fallback. Any of those beats the default behavior of a free-running agent, which is to take its wrong answer and confidently feed it to step eight.

Containment is the move, not correction. You stop the spread instead of chasing a perfection the model can’t deliver. We built Tallyfy’s automation rules and approval steps around exactly this, because a process that already names every step and owner is a process you can drop a checkpoint into.

Go back to that procurement agent and add one checkpoint. After the agent reads the contract terms, the process pauses for a human to confirm the payment window before anything irreversible moves. The net-60 misread now dies at that gate instead of flowing into nine downstream steps, because a person glances at one field and catches what the model fumbled. That’s the whole mechanism. You don’t need the human to do the work - you need them positioned at the one step where a wrong answer turns expensive, doing the cheap thing the model can’t reliably do for itself, which is notice. Add a retry on the validation step and a fallback for the genuinely ambiguous contracts, and the run that was a coin flip becomes something you’d actually let near a payment.

The model still misreads at the same rate, but now the mistake has nowhere to go.

We see the same principle on our own machines, oddly enough. The thing that most reliably catches an AI tool drifting or inventing a fact isn’t the tool re-checking itself - it’s an external gate that runs after it and refuses to pass a bad result through. We’ve also watched what happens with no gate at all: a single confirmation expanding into more than a hundred deletions because nothing in the loop capped how far one action could reach. Same model, wildly different outcome, and the only variable was whether something external was allowed to say no. That’s the blast radius a process contains and a bare agent doesn’t.

$A 20-step agent compounds 95 percent per step to 36 percent overall versus a workflow that pauses and retries a failed step to hold above 90 percent$

The retry slider you dragged earlier is this idea made literal. A check after a step turns one shaky run into a reliable one, not by making the model better, but by refusing to let a quiet miss become a finished result. This is the practical version of why an AI agent needs a workflow engine: the engine doesn’t supply intelligence, it supplies the structure that keeps the intelligence honest.

Where to point the agent instead

Never hand an agent a twenty-step job and hope. Hand it one step. The reading, the classifying, the drafting - the bounded tasks a model is genuinely good at - and put the high-stakes moves behind a human gate. That’s the difference between binding an agent to a workflow and letting it free-roam, and it’s also why the workflow patterns Anthropic and others converged on all externalize state instead of trusting the model to hold it.

Run the math on whatever you’re about to build before you build it. Count the steps. Be honest about the per-step accuracy. If the end-to-end number scares you, that’s not a reason to find a better model - it’s a reason to chop the job into steps a process can check. The full case for why AI is built for single tasks, not whole jobs follows directly from this one curve.

A 95-percent model in a chain of twenty is a 36-percent agent. A 95-percent model running one bounded step inside a process you can see is just a fast, useful step. The intelligence is the same in both. Only one of them survives contact with twenty steps of real work, and it’s the one you wrapped a process around.

A 20-step AI agent fails 64% of the time

A 20-step AI agent fails 64% of the time

Summary

Run the multiplication before you build

Why AI needs one defined task

Why nobody notices until step 7

Capability was never the bottleneck

A process can’t raise 95, but it can contain it

Where to point the agent instead

About the author

Automate your workflows with Tallyfy