Summary
- Agents loop and stall when they run their own control flow - asked to iterate over a list, a model loses track of what it finished, repeats work, and stretches a three-minute job into ten.
- A real war-story shows the pattern - a developer in the Hacker News thread “Agents need control flow, not more prompts” watched a QA agent break after about 30 files, sometimes missing one, sometimes re-testing a bundle for no reason.
- The fix is determinism around the model, not a better prompt - that same developer wrapped the model in a basic harness that owns the loop and stores results, and the system got, in their words, “a billion times more reliable.”
- Let the model judge one item, let the process run the list - externalize the loop counter, the state, and the per-item retry. Start with one workflow in Tallyfy
Ask an AI agent to work through a list on its own and here’s how it tends to fail. Not with a crash. With a loop that won’t end, a step it silently skips, or the same file checked three times while another never gets touched. The model isn’t too dumb to do the work. It’s bad at keeping track of where it is in the work, and a long loop is mostly bookkeeping.
That’s the gap between judging one item and orchestrating a hundred. A model can read a file and decide if it passes. What it can’t do reliably is remember which files it already read, count how many are left, and know when to stop. Hand it both jobs at once and the second one rots. This is a big part of what it takes to trust AI with a long job, and it’s why the cleanest agent deployments give the model the judgment and give something else the loop.
It’s a clean division of labor once you see it. The model is a judgment engine, sharp on the local call and indifferent to the global plan. The loop is bookkeeping, and bookkeeping wants a ledger, a counter, and a hard rule for when to stop, none of which a model keeps reliably. Mix the two and the judgment stays good while the bookkeeping falls apart, which is why the failure looks so odd. The agent clearly understood every file, and still couldn’t get through the list.
Workflow Automation Software Made Easy & Simple
Why does a model that can test any single file perfectly fall apart running the list?
Why a model can’t reliably run its own loop
A loop has state, and state is exactly what a language model is worst at holding across many steps. To run “for each file, test it,” the agent has to track which files are done, which failed, which still need a retry, and how many remain, all inside the same context it’s using to do the actual testing. That bookkeeping competes with the work, and it lives in the place most likely to drift: the running context. Miss one update and the agent loses count. Now it retests files it already cleared, skips one it never started, or talks itself into believing a finished item needs another pass.
None of this is a reasoning failure. The model can judge any single file perfectly. It’s a state-tracking failure, and state-tracking is a job for code, not for a probability distribution over the next token.
The failure shows up in a few recognizable shapes. It loses count, so it believes it’s on item twelve when it’s really on item nine. Off by one, it skips a file in silence. Then an error on one item convinces it that four earlier items need re-checking, so it redoes work that already passed. Worst of all there’s no hard exit, so a job that should end at item thirty either wanders past the end or circles back, because nothing outside the model is enforcing “done.” Each shape is the same root cause in a different costume: the count, the cursor, and the stop rule all live in a context that’s busy doing something else.
A loop is mostly bookkeeping, and bookkeeping is not what models are for.
Hand that bookkeeping to plain code and every one of those shapes vanishes at once. The model never has to count, so it never miscounts.
What the 30-file loop actually looked like
You don’t have to take this as theory. A developer posting as 827a in the Hacker News thread “Agents need control flow, not more prompts” described it from a real system. They’d built a QA agent to run through a couple hundred requirements files in a browser session, and tried to let the model manage the high-level control flow: look in the directory, and for each requirement file, decide whether the app meets it. Reasonable ask. It worked, until it didn’t.
In their words: “This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason.”
That last detail is the tell. A single failure didn’t just fail, it scrambled the loop’s sense of progress, and the agent went back and re-did work that was already done. Their model’s ability to orchestrate the run, they noted, had no consistency across versions. Sometimes it worked. Sometimes it didn’t.
It’s not only QA loops. The same thing happens to a data-migration agent told to “clean and import each of these records,” which re-imports a batch it already processed after one malformed row throws it off, or to a reconciliation agent that re-opens accounts it already balanced because a later mismatch made it second-guess the earlier ones. Any time the job is “do the same thing to every item in a list,” the model is being asked to be the loop, and the loop is the one place it’s reliably weak.
One thing that surprised us watching AI tools run on our own systems is how fast an uncapped loop turns into a messy runaway. A single request once fanned out into dozens of parallel sub-runs because nothing in the loop decided when enough was enough. Another left orphaned runs alive that an external process had to step in and kill, since the agent never registered they were finished. The model wasn’t malfunctioning in any of these. It just had no reliable sense of “stop,” because the thing that should own “stop” is the loop, and the loop was sitting inside the model.
And the cost compounds. Every needless extra pass is another roll of the dice on a step that doesn’t always succeed, so a loop that re-does work isn’t only slow, it gets less reliable with each redundant lap. Drag the step count up and watch the end-to-end success rate sink, which is the math behind why a long, loose loop is a bad trade.
Why AI needs one defined task
90% per task, 10 tasks in a row, is about 35%. A 10-step job done blind is worse than a coin flip.
Read why AI is for tasks, not jobsMove the loop out of the model and into a harness
The developer in that thread didn’t fix it with a cleverer prompt. They wrapped the model in code: “We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file.” That, they said, made the system “a billion times more reliable.” The model still did the judging. The harness did the looping.
The thing is, that’s also the argument of the article the thread was discussing, written by a developer named Brian. Reliability, he writes, “requires moving logic out of prose and into runtime,” with “explicit state transitions and validation checkpoints that treat the LLM as a component, not the system.” A component, not the system. The model is one part you call when you need judgment, not the thing running the show.
A workflow engine is basically that harness, just one you don’t have to cobble together yourself for every project. It owns the for-loop, the counter, the state, and the per-item retry. It knows which items are done because it recorded each one, not because a model is trying to remember. When an item fails, the process decides what happens next: retry it, skip it, or route it to a person, while the other ninety-nine items keep moving. We built Tallyfy’s automation rules and task and approval steps to externalize that orchestration, because “for each X, do Y, and here’s what happens when Y fails” is a process, not a prompt.
Spell out what that harness actually owns and the reliability stops being mysterious. It keeps a ledger of which items are done, so a restart picks up where it left off instead of re-running the lot. Retry logic lives there too: a failed item gets a fixed number of attempts and then escalates, instead of looping forever. It caps how many items run at once, so one request can’t quietly fan out into hundreds. And it holds the exit condition outside the model, so “stop at the end of the list” is a fact rather than a hope. Those are mundane properties for a piece of software and nearly impossible for a model improvising inside its own context, which is the whole reason you move them out of the prompt. Restartability alone tends to pay for the harness: when something dies at item 180, you rerun and it resumes, instead of starting the whole list over and paying for all 180 again.
The same developer made one more point worth sitting with. Wrapping the model in a harness, they said, also made their system “impossible to run on any managed agent platform,” because the popular platforms assume “the agent has to run everything.” That assumption is the bug. The market keeps shipping “let the agent drive” when the reliable pattern is “let the agent decide, and let the runtime drive.” A workflow engine is that idea made into a product: the runtime drives, the agent decides, and the two stop fighting over who holds the loop.
Let the model do the judging, and let the process keep the count.
Let the model judge, let the process count
None of this means AI agents are useless on big jobs. It means you point them at the right slice of a big job. The judgment per item, the reading, the classifying, the deciding-whether-this-passes, that’s real and the model is good at it. The orchestration, the looping, the counting, the knowing-when-to-stop, goes to something built to be deterministic. Spin up the agent for the parts that need a brain, and let plain code run the parts that need a memory.
We have seen this split hold up every time: the loops that ran clean were the ones where a person, not the model, could have said how many items were left at any moment.
Before you let an agent run a loop, three questions sort out whether you’re safe or sorry. Does the job repeat the same operation over a list of items? Does it need to know reliably when it’s finished? And does a missed or doubled item actually cost something real? Three yeses, and the loop belongs to a harness, with the model called once per item to do the judging. One no, and a quick agent prompt is probably fine. The trap is reaching for a fully autonomous agent on a job that’s three clear yeses and hoping the model holds the count this time.
So before you ask an agent to manage a long loop, ask who owns the exit. If the answer is “the model, somewhere in its context,” you’ve already met the agent that loops forever. Give the loop to a harness, give the items to the model, and the thing that was burning ten minutes on a three-minute job goes back to taking three. It turns out this is the same lesson the reliability math tells from the cost side and why an AI agent needs a workflow engine tells from the structure side: AI is built for a single task, not a whole job, and the job is what the process is for.