AI-built workflows need humans in the loop - every time

Summary

The draft is the easy part - when we measured Tallyfy’s own AI step generation, roughly 80% of generated form fields were usable and about 70% of automations matched intent, which means a fifth to a third of the machine’s output still needed a human editor before launch.
Where do AI-built workflows actually break? Not on the step sequence. They break on the exception path, the regulatory variant, and the approval quirk nobody ever wrote down for the model to read.
The tooling already concedes the point - AgentGate ships policy-gated approvals for agents, Anthropic’s Agent SDK routes unmatched tool calls to a human callback, and EU AI Act Article 14 writes a literal stop button into law.
Review at design time AND at run time - one gate catches structural mistakes before launch, the other catches per-instance judgment calls. See how Tallyfy builds both gates into one process: book a demo.

Solution Workflow & Process

Workflow Automation Software

Workflow Automation Software Made Easy & Simple

Save Time On Workflows

Track & Delegate Tasks

Consistency

Explore this solution

The demo is real. You type “build me a client onboarding workflow with a document request, a compliance check, and a kickoff call,” and a workflow appears - steps named, ordered, plausible. I build workflow software for a living, and the first time I watched a model do this it was sort of thrilling.

Then an actual client goes through the thing.

My answer to the question in the title, up front: a workflow an AI drafted and nobody reviewed will encode wrong assumptions that surface weeks later, on the instances that matter most. Nobody needs to ditch the AI draft. What’s missing is two human gates - one when the workflow is designed, one while it runs - and they catch different mistakes, so you need both. That’s the whole argument. The rest of this post is what each gate is for, what the agent-tooling world quietly admits about this, and where human judgment sits in AI-heavy operations once the generation part becomes cheap.

Why does the AI-drafted workflow break a month in?

Because the model drafted the average process, and you don’t run the average process.

A question we keep fielding from teams trying natural-language builders: “the generated workflow looked right - why did it fall over?” The sequence was fine. Employee onboarding looks like employee onboarding pretty much everywhere: collect documents, provision accounts, schedule orientation, assign a buddy. Any decent model has seen ten thousand versions of it.

What the model has never seen is your version - the rule that purchase approvals over $25,000 go to the CFO only after legal has initialed the vendor terms, or that German employee data can’t touch the US payroll system, or that the intake form needs a fourth field because one regulator in one state says so. Those constraints live in people’s heads and in old email threads. They are exactly what an AI cannot pattern-match its way into, and exactly what makes a process yours rather than a template.

We have hard numbers on this from our own product, and I’d rather quote ours than someone’s marketing. When Tallyfy measured its own AI step generation in production, around 80% of generated form fields were relevant and usable, and about 70% of generated automations matched what the workflow actually needed. Read those numbers the other way: one in five fields was wrong, three in ten rules were wrong, and the steps themselves were only a fraction of what made the template useful - form fields, automations, assignments, and deadlines still needed a person who knew the business. The drafts were worth having. Shipping them unreviewed would have been malpractice.

The same engineering write-up records the less glamorous stuff too: generation taking 25 seconds while a user stares at a spinner, a continuation pass taking closer to 40, AI returning markdown where the renderer wanted HTML so users saw raw asterisks in their step descriptions. None of those bugs were the model being dumb. They were the gap between a demo and a production system, and that gap is precisely where a reviewing human stops being optional.

The dangerous part is that none of this announces itself. A made-up vignette to show the shape: the AI-drafted procurement flow runs clean for three weeks, then an invoice from a new vendor in a sanctioned country sails through because the screening step the model never knew about isn’t there. The workflow didn’t crash. It worked as written, and as written was wrong. A month in, that gap stops being hypothetical.

Split the human gate in two

Design time and run time are different jobs, and conflating them is the most common mistake in this whole conversation.

The design-time gate happens before the workflow ever goes live. A person who knows the process reads the AI’s draft the way a manager reads a new hire’s first SOP: is the legal review actually in there, and is the approver the right role? Did the model invent a deadline that contradicts the contract? This review is fast precisely because the draft exists - editing beats authoring, which is the entire reason to let AI draft at all. The right reviewer is the person who will answer for the process - the operations lead who owns onboarding, the controller who owns procurement - not whoever happens to administer the software.

What we got wrong at first, frankly, was treating this as optional polish for power users. It isn’t polish. It’s where structural errors get caught, and a structural error ships to every future run of the process.

The run-time gate lives inside the workflow itself: an approval step before money moves, a review task before the email leaves, a sign-off before the exception path proceeds. I covered the general case in human in the loop is not optional, so I won’t re-run the confidence-scoring mechanics here. The short version is that run-time review catches what no design review can: the individual instance that’s weird.

Plain-English description to AI workflow draft, through a design gate, then live runs with a run-time approval gate

Two gates, two failure classes. The design gate catches errors of structure: a missing compliance step, a wrong approver, a generic 3-day deadline where your SLA says 24 hours. The run gate catches errors of instance: this refund is 40 times the usual size, this applicant’s paperwork contradicts itself, this contract has a clause nobody’s seen before. That asymmetry deserves naming. A design miss costs you a little on every single run until someone finally notices, while a run miss costs you once, possibly enormously.

Skip the design gate and you ship a structurally wrong process that runs wrong every time. Skip the run gate and your structurally correct process handles the weird case at full speed with nobody watching. Teams keep buying tools to solve one gate and assuming they got the other for free, and the assumption fails quietly in both directions.

Even the agent-tooling crowd ships a human gate

Look at what the people building agent infrastructure actually ship, as opposed to what the launch videos imply.

In February 2026, a developer with the handle amit_paz - a different Amit, for the record - posted a Show HN for AgentGate, an MIT-licensed approval layer for AI agents. The pitch opens by admitting the gap: AI agents are getting good at doing things autonomously, but “should this agent actually send that email / delete that file / deploy to prod?” is still an open problem. Its design routes by policy - auto-approve the safe stuff, auto-deny the dangerous stuff, and send everything in between to a person on Slack, Discord, email, or a dashboard, with a full audit trail for every request and decision. The project’s README compresses it into six words: agents request, policies decide, humans approve. It shipped as 8 npm packages with 497 passing tests, which tells you someone treated the human gate as production infrastructure rather than a checkbox.

Anthropic made the same call in its own developer tooling. The Claude Agent SDK’s permission system evaluates every tool call through hooks, deny rules, ask rules, permission modes, and allow rules - and anything still unresolved lands in a callback whose documented job is to “prompt users for approval at runtime.” Hooks run before everything else, and deny rules stay live even in the most permissive bypass mode. Its planning mode goes further still: file edits are never auto-approved while planning, no matter what the allow rules say. That’s a vendor whose commercial interest runs toward more autonomy, building a mandatory pause into the loop anyway. The layering isn’t an accident of API design. It’s a statement about where judgment belongs.

Regulators reached the same conclusion from the other direction. Article 14 of the EU AI Act requires high-risk AI systems to be designed so the humans overseeing them can “disregard, override or reverse the output” and can “interrupt the system through a ‘stop’ button or a similar procedure.”

A stop button. In legislation.

I wrote about what the Act’s moved deadlines mean for process owners separately, but the design principle stands on its own: when toolmakers, platform vendors, and lawmakers independently install the same gate, the gate is probably doing real work.

None of this is anti-AI. Every one of these systems exists to run more automation, not less - the gate is what makes the automation deployable. That’s also the stance behind Tallyfy AI: AI does the work inside a step; a person owns the consequential transitions.

Where Tallyfy actually is on this

I’ll be specific about our own state here, because vague claims are this category’s default setting.

Tallyfy already ships AI assistance for drafting: it can generate a template’s step structure and suggest descriptions, and our published engineering numbers above tell you exactly how far that goes - useful skeleton, mandatory human edit. The fuller version, where you describe your process conversationally and get a near-complete draft with fields and rules, is something we’re actively building right now. It is not shipped, and I’m not going to pretend otherwise. The honest pitch for it is the one this post has been making: the model produces the rough draft and matches common patterns; a person supplies the regulatory steps, the thresholds, and the org quirks, then signs off before launch.

At run time, what teams actually switch on inside Tallyfy today is deliberately modest: a step that pulls the fields out of a contract, a step that sorts an incoming request to the right queue, a step that writes a first draft for a person to send. Every one of those is young, every one is scoped to a single job, and the connection increasingly runs through our MCP server. The pattern that holds up is plain: the AI does its step, and the consequential next move waits on a task or approval with a named owner and a deadline. Turns out the same two-gate split covers run-time AI too - scoped step, human transition - which is what disciplined workflow automation has always looked like, with or without a model in the loop.

Could we skip the design gate once the models get better?

I doubt it, and the reason isn’t model quality. The constraints that break AI-drafted workflows aren’t in any training set - they’re local, unwritten, and invisible until the day someone puts them into a process. Better models will draft better averages. Your weird rules will still be yours.

The likelier evolution is that builders get better at asking. A workflow generator that interviews you - “who approves purchases above what amount?”, “any region-specific steps?” - would close a real share of the gap, and it’s the direction we find most promising in our own work.

Notice what that does, though: it moves the human’s judgment earlier, into the interview. Someone still has to know the answers and stand behind them. The gate relocates. It doesn’t disappear.

Make the review cheap, not heroic

A design gate that demands heroism will get skipped, and then you don’t have a gate - you have a clunky ritual everyone resents. The way to make review stick is to shrink what’s being reviewed.

Keep the unit small. A workflow of 10 to 20 steps fits on one screen and gets a real read. A 90-step monolith gets a scroll and a shrug. Nobody reviews what they can’t hold in their head, and no checklist fixes that. Split the giant process into linked smaller ones and review each on its own terms. This is the same logic that makes vibe-coded integrations maintainable when they stay small - the economics of review track the size of the unit, not the talent of the reviewer. Smaller units also shrink what a missed review can hurt.

Give the reviewer a checklist instead of a hunch. Four questions cover most of the failure surface:

Which steps touch money, customers, or regulated data, and does each one have the right approver?
What did the model leave out - the compliance check, the regional variant, the notification someone legally must receive?
Are the deadlines real ones from your SLAs, or plausible-sounding inventions?
Where does the exception path go, and is there a human at the end of it?

And know what not to review, because over-reviewing kills the gate as surely as skipping it. Don’t line-edit the AI’s step descriptions or argue with its phrasing - that’s wordsmithing, and the draft’s prose is the part the model does fine. Review coverage and consequences: the steps that exist, the steps that don’t, and who holds the pen at each decision. A reviewer who knows they’re checking structure rather than copyediting moves faster and misses less.

Then put one named person on the sign-off, with the approval recorded - the same accountability you’d demand for a substrate that runs generated integration code. Review without an owner decays into a formality within a quarter; that said, an owner with a one-screen draft and four questions can clear a real review in minutes. That’s the trade: a few minutes of named human attention at design time, a designed approval at run time, and in exchange the thousand future runs of that workflow inherit judgment instead of inheriting a guess.

The workflow you described in one sentence still deserves one reader before it runs a thousand times. Those few minutes of reading are the no-brainer. Skipping them just orders the incident in advance.

AI-built workflows need humans in the loop - every time

AI-built workflows need humans in the loop - every time

Summary

Why does the AI-drafted workflow break a month in?

Split the human gate in two

Even the agent-tooling crowd ships a human gate

Where Tallyfy actually is on this

Make the review cheap, not heroic

About the author

Automate your workflows with Tallyfy