I let AI build tools for my business. Here is what broke.

Summary

The logic was never what broke - I had AI build real internal tools, and the code it wrote was fine. What failed was everything around the code: when to stop, what data counted as “all of it,” and how much it was allowed to spend.
Three failures, one missing thing - background work that never cleaned up and quietly starved a machine until an outside watcher caught it; a confident, wrong count from a tool that only saw a slice of the data; an uncapped run that turned a small request into a real bill.
The fix was an outside check, not a better prompt - the thing that made AI-built tooling safe was an external verifier the AI could not talk its way past. It lands all three pillars at once: it checks the work against reality, it never trusts the model to skip the check, and it is the off-switch.
Vibe-code the logic, let the platform own the rest - context, credentials, and edges are the workflow platform’s job. Try Tallyfy free

I’ve let AI build a fair amount of my own internal tooling over the past year. Small tools, the kind you describe in a sentence and get working code back for. Most of them helped. A few broke in ways that taught me more than the wins did, and the failures all rhymed.

Here’s the short version before the stories. Nothing that broke was the AI writing bad logic. The logic was the easy part, every time. What broke was everything I’d quietly assumed the tool would handle and it didn’t: it didn’t know when to stop, it answered from data it couldn’t fully see, and it ran up cost because nothing capped it. Three different failures, one root cause underneath all of them. The tool had no scaffolding around it.

This isn’t the vibe coding kills the integration middleman argument, and it isn’t the arithmetic of why long AI chains collapse. This is the first-person version: what actually broke when I let AI build tools I depend on, and the one thing that made the next round safe. It’s the messy, lived-in corner of how AI is changing work itself, the part that doesn’t fit in a demo.

What actually broke

Three failures, told as shapes, because the specifics don’t matter and the patterns do.

The first was a tool that spun up background work and never cleaned up after itself. It would kick off a job, the job would finish or fail, and the leftover process would just sit there, holding resources. One at a time, no problem. Over days, the leftovers piled up until a machine was quietly starving, and I only caught it because something outside the tool noticed the machine was running out of room to breathe. The tool itself had no idea anything was wrong. It was happily spawning more work right up to the edge. Nothing in it ever asked “did the last thing I started actually die?”

The second was a confident, wrong answer. I had a tool that answered a counting question, and it gave me a small, tidy number with total confidence. The number was wrong, and not by a little. Turns out it had answered from the slice of data it could reach in one pass, not the whole set, and nothing in the setup ever defined that “all of it” meant all of it. No crash, no error, no flag. Just a clean, wrong answer that I almost acted on before something felt off.

The third was cost. An uncapped run, a loop that called a paid service far more times than I’d pictured when I described it, and a bill that was bigger than the task was worth. Not catastrophic. Just a small request that quietly became an expensive one because nothing said “this much, then stop.” How many people find out their AI tool had no spending limit the same way, from the invoice?

Here’s the thread running through all three. The tool always thought it was doing fine. It wasn’t lying to me, it just had no way to know any better, because knowing better was a job somebody had to define and nobody did. That’s not a problem you fix with a smarter model. It’s a missing-scaffolding problem, and you fix it with structure around the model, not more model.

Three failures, one root cause

Line those up and the shared cause is obvious. The first tool had no edges. The second had no context. The third had no edges and no real handle on cost. None of them failed because the model was bad at writing code. They failed because the code was all anyone built, and a tool in a real business needs three things the generator doesn’t hand you.

It needs a process around it so it knows when to run, who owns the output, and what “done” actually means. It needs a managed place for credentials so it acts with scoped, revocable access instead of a key in a text file. And it needs bounded edges, a cap and a gate and an off-switch, so it can’t do too much when it goes sideways. Context, credentials, edges. Every one of my failures was one of those three, missing.

Bram Cohen, who built BitTorrent and has been blunt about the limits of the hype, put the underlying point well: “Bad software is a decision you make.” That landed for me, because none of my failures were the AI’s fault in any useful sense. They were my decisions, or my non-decisions: I let the tool run with no edges, so it ran with no edges. The model handed me logic in two minutes. Whether that logic was safe to depend on was on me, and the part I’d skipped was the scaffolding.

An AI-built tool becomes safe to depend on only when it has a process for context, a vault for credentials, and gated edges plus an outside auditor

The logic was the easy part every single time. What broke was never the code the AI wrote; it was the absence of everything a workflow platform is supposed to hold around the code. That’s the whole of mega trend two for me, learned the slow way: generation got cheap, and the substrate that makes generation safe to lean on did not.

So what finally made it safe?

One thing did more than the rest, and it’s the least glamorous idea in this post: an outside check the AI can’t argue with.

On my own systems I run a verifier that sits outside the AI’s work. When an agent claims it finished a task, the check reads what it actually did against what it said it did, and blocks the result when those two don’t match. It catches the confident-wrong answer, because it doesn’t take the tool’s word for the count. It catches the half-finished job, because “I’m done” has to survive an inspection the tool doesn’t control. It is, in the plainest sense, the off-switch: an independent thing that can say no.

Why does outside matter so much? Because a tool that’s gone wrong is the worst possible judge of whether it’s gone wrong. It reports success in the same flat, confident voice whether it nailed the task or wrecked it, because from where it sits there’s no difference between the two. A verifier that lives inside the tool’s own logic inherits that exact blind spot. The one that works sits outside, where the AI can’t reach it, can’t talk it down, and can’t mark its own homework. My own longer case for this is over on why the best AI guardrails are invisible: the protection that holds is a layer the work runs inside, not a rule you politely ask the model to follow.

This isn’t a me-on-my-laptop trick, either. Any team letting AI touch real work has the same hole: the AI reports success, and nobody has an independent way to know whether that success is real. Most teams fill the gap with hope, or with one diligent person who happens to double-check, right up until that person is on vacation. The durable version is structural, not heroic. An outside check that runs on every AI step, every time, whether or not anyone’s paying attention. That’s the line between a tool that dazzles in a demo and one you can leave running on a Friday afternoon without a knot in your stomach.

That one pattern lands all three pillars at once. It checks the work against reality, which is context. It never trusts the model to skip the check or hold the keys, which is credentials. And it can stop the run, which is the bounded edge. An external auditor is what turns “it worked on the example” into “it’s safe to let this near real work.”

One weak step drags the whole run down

Tasks in the job 10 How often AI nails one task 90%

AI does the whole job alone 35%

With Tallyfy: one task at a time 99%

90% per task, 10 tasks in a row, is about 35%. A 10-step job done blind is worse than a coin flip.

The math behind the collapse

The reliability math says why a check beats hope. A run with no verification compounds every miss down the chain; a step that gets inspected and can be retried holds near the top. Slide the numbers above and the gap is stark. This is also why I keep AI inside a workflow that tracks each step in real time rather than in a loose script: the workflow is the outside watcher. It holds the status, the approval, and the record, so a tool that goes quiet or goes wrong is visible to something other than itself.

Solution Workflow & Process

Workflow Automation Software

Workflow Automation Software Made Easy & Simple

Save Time On Workflows

Track & Delegate Tasks

Consistency

Explore this solution

When I skip all of this on purpose

I’d be lying if I said I scaffold everything, because I don’t, and you shouldn’t either.

Half the tools I vibe-code are throwaways. A bit of glue I cobble together to munge a CSV I’m staring at right now. A scratch script that pulls a few numbers for something I’m writing, then goes in the trash that afternoon. A helper that only ever reads, so the worst it can do is be wrong on my own screen. None of those get a process, a vault, or an off-switch, because there’s nothing to protect: no real data, no spend, nobody downstream who gets hurt if it’s wrong. Adding ceremony to a tool like that is just slowing myself down to feel responsible.

The line is whether anything real is on the other side of a mistake. If a wrong answer costs money, touches someone else’s account, or can’t be undone, the tool has crossed from “personal hack” into “thing a business depends on,” and that’s when scaffolding stops being optional. Below that line, vibe-code with abandon. Above it, the bare tool is a quiet incident with a future date on it, and I’ve met enough of those to stop pretending otherwise.

So here’s where I landed after the broken ones. Vibe-code the logic, gladly, because that part really is nearly free now. Then hand the context, the credentials, and the edges to something built to hold them. The tools that lasted on my systems weren’t the ones with the cleverest code. They were the ones with an outside check that could tell them no. If you want a place to run AI steps that come with the watcher already attached, start free and put the first one inside a process that’s keeping score.

I let AI build tools for my business. Here is what broke.

I let AI build tools for my business. Here is what broke.

Summary

What actually broke

Three failures, one root cause

So what finally made it safe?

One weak step drags the whole run down

When I skip all of this on purpose

About the author

Automate your workflows with Tallyfy