Most multi-agent systems fail - except the two-agent pattern

Summary

Swarms fail at coordination, not intelligence - engineers running multi-agent setups at work describe agents overwriting each other’s context and errors cascading through the herd. The fix they converged on wasn’t a smarter model. It was a smaller team.
Why do pairs survive when swarms don’t? Ownership. With one agent writing and one reviewing, the verdict has exactly one home - petesergeant’s “Claude writes, Codex reviews” harness is the working example - while three-plus setups leave nobody owning the final call.
Anthropic’s engineering guidance points the same way - agents that can check their own output “are fundamentally more reliable,” including having “another language model ‘judge’” the result.
Run the pair as a two-step workflow - step one generates, step two approves or kicks back. Tallyfy ships that second step as an approval today: see the pattern inside a defined workflow.

In February 2026, a Hacker News user called gusmally asked working engineers whether they actually use agent orchestrators to write code, quoting Steve Yegge’s claim - from a Pragmatic Engineer interview - that the top of the AI-coding ladder is “Level 8: you build your own orchestrator to coordinate more agents.” Yegge feels “sorry for people” who merely “use Cursor, ask it questions sometimes, review its code really carefully, and then check it in.”

The replies told a different story. The people running agent fleets day to day weren’t describing level-8 enlightenment. They were describing coordination overhead, context collisions, and a retreat - again and again - to one specific shape: a pair. One agent writes. A second agent reviews, then approves the work or kicks it back.

That pair is the only multi-agent pattern we see reliably making it through production intact, and the reason says more about how operations teams are absorbing AI than about any model’s IQ. It lasts because it’s secretly something much older than agents: a two-step workflow with an approval gate.

Why agent swarms keep dying in production

Read the thread’s war stories back to back and the pattern of failure is strikingly consistent - nobody complains the agents are dumb. They complain the agents can’t share a workspace. A commenter called Aurornis put the economics plainly: as you scale up sub-agents, “you spend so much time managing the herd and trying to backtrack when things go wrong that you would have been better off handling it serially with yourself in the loop.”

Solution Approvals

Approval Management Software

Approval Management Made Easy

Save Approval Time

Track & Delegate Approvals

Consistency

Explore this solution

Another engineer, _sinelaw_, reported that parallel agents worked on a greenfield project right up until the codebase matured, at which point every feature became cross-cutting and stability mattered - hard to protect with “parallel agents running amok.” A third, jovanaccount, got specific about where the time actually went: “80% of my debugging time wasn’t fixing bad code, but fixing race conditions where agents were overwriting each other’s context.” His fix was a traffic-light protocol he describes as “essentially a semaphore for swarms,” built “to force serialization on critical tasks” - concurrency control, rebuilt by hand, because the swarm shipped without any.

Notice what those three reports have in common. Coordination is the bottleneck, and coordination overhead grows with every member you add. A swarm of five agents isn’t five times the output of one agent; it’s one output plus four new ways for state to go stale, plus a merge problem nobody assigned to anyone. freakynit, in the same thread, gave the category its bluntest review: “agentic swarms: that’s marketing bs” - softened only by “at least for now.”

Even the cost case collapses on inspection. avaer ran the comparison that swarm vendors prefer you didn’t: people running 15 agents to write software “could probably use 1 or 2 and a better multi-page prompt and have the same results for a fraction of the cost.” And wasmainiac asked the question that hangs over every fleet demo - “How does one even review the code from multiple agents.” Nobody in the thread had a satisfying answer, which is the answer.

We sometimes hear from teams partway through this exact retreat - a swarm pilot that demoed brilliantly, then spent its production budget on untangling what its own members did to each other. The pattern they land on afterward is rarely “no agents.” It’s fewer agents with clearer jobs.

The thing is, none of this is a surprise if you’ve ever managed people.

Five smart contributors with no agreed handoffs and no decision rights produce exactly the same mess, just slower. The swarm didn’t invent a new failure - it inherited the oldest one in management and just got to it faster.

What makes two agents different?

Strip a multi-agent system down to two members with fixed roles and the coordination problem almost disappears.

There’s one handoff. The state moves in one direction, then either ships or comes back. Nothing runs in parallel, so nothing collides. The whole class of failure the swarm crowd spends its debugging budget on - stale context, overwritten state, merge fights - can’t occur in a topology this small, which is a kind of reliability you get free, before anyone tunes a prompt.

The thread’s working example is petesergeant, who found that “‘Claude writes, Codex reviews’ has shown huge promise as a pattern” and packaged it into a small open-source harness called moarcode, whose tagline is honest about the division of labor: “You design, Claude writes, Codex reviews, and Gemini doesn’t get installed.” He spends most of his day inside that loop, admits it still has rough edges, and the payoff he names isn’t speed. It’s trust: he believes the code coming out far more than anything a single model produced alone, which is the entire economic argument for the second agent in one sentence. Another commenter, joshuaisaact, runs the same shape with one model - wipe the context, have a fresh instance review the pull request before a human looks at it. Two hats, one brain, same gate.

Anthropic’s engineering team lands on the identical principle in its guide to building agents with the Claude Agent SDK. The agent loop it recommends is “gather context -> take action -> verify work -> repeat,” and the verification advice is direct: agents that can check and improve their own output “are fundamentally more reliable,” including the option to “have another language model ‘judge’ the output of your agent based on fuzzy rules.”

Why does the second model change so much?

Because generation and judgment are different jobs with different failure modes. A writer in mid-stream is committed to its draft - it autocompleted its way there, and everything in its context says keep going. A reviewer starts cold, holding only the standard and the work. Fresh eyes are cheap to manufacture when the eyes are a model, and the pair has a property no swarm offers: the verdict has exactly one owner.

Nobody owns the verdict in a three-agent system

Add a third agent and watch what happens to accountability. If two reviewers disagree, who decides? If the planner, the writer, and the critic each touched the output, which one answers for the bug that shipped? Every agent you add past two splits the responsibility for the final call into smaller and less useful pieces - until the system produces work that everyone contributed to and nobody approved.

People who run these setups feel the bottleneck precisely. hrishikesh-s, who built his own orchestration tool inside emacs, caps his fleet deliberately: “The sweet-spot is 2-3 agents co-ordinating at a time and me overseeing everything,” because past that, “I quickly become the bottleneck when I review the diffs/plans.” The constraint isn’t compute. It’s that final judgment doesn’t parallelize - somebody, human or model, has to own the yes.

An early mistake we made building Tallyfy: assuming teams would add review steps to their processes on their own. They mostly didn’t - generation feels like progress and checking feels like overhead - so we learned to treat the approval as a first-class step rather than an optional decoration. The HN crowd reached the same place from the other direction. As 0xecro1 put it after routing more tokens into checking than producing: “The review pipeline should be heavier than the generation pipeline.”

Read that twice, because it inverts how most teams budget their AI effort.

Almost everyone spends on the writer - better prompts, bigger context, a stronger model - and treats review as a formality. The practitioners who ship are doing the opposite. They assume the draft is wrong somewhere and fund the step that finds out where. d4rkp4ttern, in the same thread, framed the industry’s hype posts through exactly this lens: when someone brags that AI writes almost all of their code, “the top question I’m curious about is, how much of the AI-written code are they reviewing”? The generation number is marketing. The review number is the system.

Run the pair as a workflow, not a clever prompt

Here’s the part that gets missed when this pattern stays trapped in engineering blogs: writer-plus-reviewer isn’t really an agent architecture. It’s a two-step business process - step one produces, step two approves or returns - and your company already runs dozens of processes with exactly that shape. We walked through the three workflow patterns agents rely on separately; this is the deeper cut on the one multi-agent shape from that family that keeps holding up in production.

Two-agent reviewer pattern as a workflow loop: a writer agent drafts, a reviewer approves the work or kicks it back

Put a real document through it.

A contract draft needs to go out. Step one: an AI step reads the deal terms and produces the draft. Step two: a reviewer - legal counsel today, maybe a second model checking clause presence and risk language tomorrow - either approves it or kicks it back with reasons. The kick-back isn’t failure; it’s the pattern working, and it’s the part most teams forget to design. Each rejection carries feedback, the writer revises against something specific, and the loop runs until the work clears the bar or hits a retry ceiling and escalates to a person. That ceiling matters more than it looks - we covered what happens when nobody caps the loop - because a reviewer without an escalation path is just a more polite way to burn tokens forever.

One distinction does the heavy lifting here, and it’s worth being precise about it. A reviewer-as-step is not the same as a reviewer-as-prompt. Telling one model “write the draft, then check it carefully” isn’t a gate - it’s the same context grading its own homework, with all the writer’s commitments intact. The pattern only works when the boundary is real: separate context, separate standard, and a handoff the runtime enforces rather than a sentence the model is free to skim. That’s why this belongs in a workflow engine rather than in prompt engineering.

In Tallyfy this shape isn’t an integration project. It’s a template: a task step assigned to an AI for the draft, followed by an approval step that routes approve-or-reject, with the rejection path looping back and every cycle stamped in the audit trail - who rejected, when, and what changed before the next attempt. A deadline on the review step keeps the gate from becoming the queue where work goes to age. And process owners read the whole loop at a glance, which means the review standard - the thing the reviewer checks against - lives in the open where the team can tighten it, instead of inside a prompt only one engineer has seen.

There’s also a quieter mathematical reason the gate beats a longer leash. Chained generation compounds error multiplicatively - each unchecked step multiplies the odds that the final output is wrong - while a review gate resets the chain by catching defects before they propagate. You can feel the difference in two minutes with the sliders below.

Why AI needs one defined task

Tasks in the job 10 How often AI nails one task 90%

AI does the whole job alone 35%

With Tallyfy: one task at a time 99%

90% per task, 10 tasks in a row, is about 35%. A 10-step job done blind is worse than a coin flip.

Read why AI is for tasks, not jobs

We got the emphasis wrong at first ourselves - more horsepower for the writer, kind of an afterthought for the check - and the numbers above are why we stopped. A modest reviewer in front of an imperfect writer outperforms a brilliant writer running unchecked, on any chain long enough to matter.

Start with the reviewer, not the writer

If you take one action from this post, make it this: before you wire up any generating agent, write down what the reviewer would check. Five bullet points. What does done look like, what’s an automatic reject, who gets the escalation when the loop stalls? That document is worth more than a model upgrade, because it’s the standard both agents - and every human around them - will be held to.

Then run the pattern at whatever level of automation you can defend. Human writer, human reviewer: that’s a tough sell to call AI, but it’s already the pattern, and it’s how most approval processes run today. AI writer, human reviewer: the highest-value version for most operations work right now, and the one we see teams adopt first. AI writer, AI reviewer, human owning the escalation path: where the engineering crowd is converging, one bounded step at a time.

Once it’s running, watch one number: how often the reviewer kicks work back. A rate near zero means your reviewer is a rubber stamp and you’ve rebuilt the single-agent problem with extra steps. A rate near half means the writer’s instructions are starving it of context, and the fix belongs upstream, not in a sterner reviewer. The healthy middle - real rejections, specific reasons, falling over time - is what a working quality loop looks like on a dashboard, and it’s a number a process owner can read without understanding a single prompt.

What you shouldn’t do is skip to the swarm. The evidence from the people who tried is blunt and recent and fair enough to both sides: coordination eats the gains, the verdict loses its owner, and the fix is the oldest move in operations - fewer hands, clearer roles, one gate that work must pass before it counts.

Two agents. One verdict. That’s the whole pattern, and it’s enough.

Most multi-agent systems fail - except the two-agent pattern

Most multi-agent systems fail - except the two-agent pattern

Summary

Why agent swarms keep dying in production

What makes two agents different?

Nobody owns the verdict in a three-agent system

Run the pair as a workflow, not a clever prompt

Why AI needs one defined task

Start with the reviewer, not the writer

About the author

Automate your workflows with Tallyfy