Summary
- Pixel-watching agents are a dead end - OpenAI’s Operator and Anthropic’s Computer Use drive a browser by reading screenshots and moving the cursor, a capability Anthropic still labels a beta feature. It is slow, spends tokens on every screen, and breaks the day you ship a redesign.
- DOM-reading agents are cleaner but not the cure - open-source projects like browser-use (around 98,000 GitHub stars) read the page structure instead of the pixels. That removes a lot of guesswork and then exposes the real bottleneck sitting underneath.
- The actual failure is a missing process - an agent can book a flight, but booked is not the same as approved, expensed, calendared, and disclosed. Nothing told it what counts as done.
- Cleaner tools make the workflow matter more, not less - the agent needs a defined process to plug into. Map one in Tallyfy
Browser agents got a loud year. OpenAI shipped Operator, Anthropic shipped Computer Use, and the open-source browser-use project crossed roughly 98,000 GitHub stars under a tagline that promised to “make websites accessible for AI agents.” All of them sold the same dream: an AI that works your software the way a person does. Point it at a site, let it click around, walk away.
A year of real deployments in, the dream looks thin. Most of these agents still don’t hold up in production, and the reason isn’t the one people reach for first. It isn’t that the models are too dumb. The newest ones are sharp. Turns out the failure sits one layer down, in a place no amount of model upgrade reaches: nobody designed the process that the agent is supposed to be running.
Workflow Automation Software Made Easy & Simple
That’s the contrarian read, and it cuts against a year of “the next model will fix it” optimism. It won’t, because the missing piece was never intelligence. This is one slice of where AI actually lands in day-to-day operations, and browser agents are the clearest example of a smart model failing at a job that was never written down for a machine to follow.
How browser agents work today
There are two ways an agent can drive a website, and the difference explains most of the pain. The first wave reads the screen. Anthropic’s Computer Use is the clearest documented example: it gives the model “screenshot capabilities and mouse/keyboard control,” so the loop is literally take a screenshot, decide where to click, click, take another screenshot. OpenAI’s Operator works on the same idea. Anthropic ships Computer Use as a beta feature, which is a polite way of saying it makes mistakes.
This pixel approach is honest about one thing: it works on any site, because every site renders to pixels. It’s also slow and expensive, since each step burns tokens describing what’s on screen, and it shatters the moment your design team moves a button. The agent that memorized your old layout is now clicking empty space.
The second wave skips the pixels and reads the page structure instead. That’s what browser-use does when it surfaces the “clickable elements” on a page, handing the model a list of real elements instead of a picture to squint at. Cleaner, faster, far less fragile.
It still isn’t ready, though, and that’s the part the hype skips over. A real application’s page structure is enormous and messy, stuffed with wrappers, tracking tags, and half-loaded widgets that have nothing to do with the task. Single-page apps redraw themselves constantly, so the element the agent grabbed a second ago might be gone now. And the structure only ever says what an element is, never what clicking it means or whether it moves the job forward.
Pixel parsing is dead.
Structured reading is better and still brittle, because it swaps a fragile picture for a noisy map. Both approaches leave the same hole open.
Clicking a button isn’t finishing the job
Here’s the wall every one of these agents hits, pixels or no pixels. An agent can click the button. What it can’t reliably tell is whether the button did what you actually wanted.
Watch the early Operator reactions and you see this exact frustration. On the Hacker News thread from one of its first users, the running question was whether an agent like this is anywhere close to a daily driver, and one commenter described it confidently searching the web and trusting the first listicle it landed on. The agent did the action. It had no idea the action was wrong. That’s not a perception bug you fix with a better screenshot reader. It’s a judgment that lives outside the click.
Take a concrete case. You ask an agent to submit an expense report. It opens the portal, fills the fields, hits submit, and reports success. Did the report route to the right approver? Was the required receipt attached? Did it land in a category that won’t bounce back from finance next week?
The agent saw a confirmation page and called it a win. Whether the work is actually done is a different question, and the page can’t answer it.
Multiply that by every step in a real job and the cracks compound fast. Each action an agent takes is a quiet bet that the last one landed right, and a chain of bets with no checkpoint between them is how a single wrong turn at step two poisons everything after it. The agent isn’t lying when it reports success. It genuinely can’t tell the difference between a task that’s finished and a task that merely looks finished from the last screen it saw.
We’ve hit a sharp version of this in our own work. An agent we built once reported a clean, confident count that was wrong by a wide margin, because it had quietly truncated the data it was reading and had no way to notice. The gap wasn’t intelligence. It was the absence of a step that checks the result.
A person carries that judgment in their head, built from a hundred small corrections. An agent doesn’t, unless something tells it.
Why cleaner tools don’t fix the real problem
The natural hope is that structured access saves us. Skip the screen-watching, give the agent declared tools to call, and the brittleness goes away. Half right. Structured tools, whether a WebMCP surface in the browser or a server an agent connects to, do kill the pixel-guessing problem. Calling a clean submitExpense tool beats hunting for the submit button every time.
But that’s the easy bug. Give the agent a perfect set of tools and a basic job still falls apart, because a real task is almost never one call. Booking a trip means find the flight, check it against the travel policy, get a manager’s sign-off if it’s over the limit, expense it, put it on the calendar, and tell the team. “Booked the flight” is one of six steps, and the other five are where the value and the risk both live. A pile of clean tools says nothing about the order they run in, who approves the expensive one, or what happens when step four fails and the agent has already done steps one through three.
Something we had to unlearn while wiring AI into our own product is that a cleaner tool surface feels like progress on the whole problem when it’s only progress on the first inch. The coordination, the sequencing, the recovery: none of that got easier. It just got harder to ignore, because the agent now fails on the part you can’t blame on a bad screenshot.
You can watch this play out in slow motion. The team swaps the screen-reader for a clean set of tools, the demo gets snappier, everyone exhales. Then the first real multi-step job runs unattended and comes back half-done, with no record of where it stopped or why, and the same worried meeting happens all over again. The tools were never what stood between a flashy demo and a deployment you can trust.
What a defined process gives a tool-calling agent
So what’s the missing layer? A workflow. Not the buzzword version, the literal one: basically a defined sequence of steps, each with an owner, an input, and a rule for what comes next. Drop a capable agent into one step of that and it stops having to improvise the coordination it’s worst at. The order is already decided. The approval gate is a real step instead of a line in a prompt the model is free to skip. And the audit trail writes itself.
Run the expense example through that frame and it stops being scary. The agent reads the receipt and drafts the entry at step one. The policy check is step two, with a hard rule, not a vibe. Anything over the threshold routes to a human at step three, which is a sign-off that blocks the next step rather than a suggestion. Only then does the report submit, and every action lands in a tracked record you can read back later. Same agent, same model, completely different level of trust, because the process is carrying the judgment the agent doesn’t have.
There’s a reason the order carries so much weight. The patterns that actually hold up for AI agents are the unglamorous structural ones: a step runs, its output feeds the next step, a gate halts the chain when a person needs to look. None of that is the agent’s strong suit. The agent is good at the bounded judgment inside a single step, reading the receipt, sorting the request, drafting the reply. Hand it the entire job and it has to cobble together the scaffolding too, and that improvised scaffolding is exactly where it comes apart.
This is why workflow infrastructure gets more important as the tool surface gets cleaner, not less. Point a sharp model at a vague job and it will improvise, and improvising near payroll or customer records is the exact thing you don’t want. We’ve made the longer case for why an AI agent needs a workflow engine underneath it, and why it’s safer to bind an agent to a defined process than to set it loose. Browser agents just make the lesson impossible to dodge. The cleaner the tools get, the more tempting it is to skip the process, and the more expensive that skip becomes.
When an agent reaches your “submit the order” step, does a real process catch it, or does it just fire and hope?
Where to point your attention first
If you’re an operations leader watching this space, the move isn’t to wait for a browser agent good enough to trust with your systems. That agent isn’t coming, at least not in the shape people imagine, because the thing holding it back was never on the agent’s side of the line.
Start on your side. Pick the one customer-facing or back-office task you’d most want an agent to take over, and write it down as a real sequence: every step, every owner, every point where something irreversible happens. Mark the steps a human has to clear. That document is worth more than any agent demo, because it’s the thing an agent can actually run safely once it exists, and the thing whose absence guarantees the agent fails no matter how good it gets. The whole story of keeping a human in the loop on the steps that matter starts here, with a process on paper before a model touches it.
Here’s the quiet upside of working in that order. The same map that makes an agent safe to deploy also makes the work better the day before any agent arrives, because a process you’ve actually written down is one you can see, fix, and hand to a new hire. You’re not building scaffolding as a favor to the robots. You’re fixing the operation, and the agent becomes the first thing that can finally run on top of it without tripping. AI rewards a clear process and punishes a fuzzy one, which is the whole reason the boring part comes first.
The agents will keep improving. Operator will get faster, Computer Use will leave beta, the DOM readers will get sharper. None of that writes your process for you. That part is still yours, and it’s the part worth doing first.