Summary
- Too many tools is a precision problem - give an agent a long menu of similar-sounding actions and it routinely reaches for a near-neighbor: archive instead of delete, a broad search instead of the exact record lookup. The model isn’t broken; the menu is too crowded to choose from cleanly.
- The vendors say so themselves - Microsoft Research catalogued 1,470 MCP servers and named “tool-space interference,” and notes OpenAI recommends “fewer than 20 functions at any one time” for “higher accuracy.”
- This is an accuracy problem, not a security one - it’s distinct from stopping an agent calling a tool at all. Of the tools it’s allowed to use, which does it actually reach for?
- A workflow step narrows the menu for you - expose only the two or three tools a given step needs, and the near-synonym that caused the mistake isn’t even on the table. See how Tallyfy scopes each step
Give an AI agent fifty tools and watch what happens the first time two of them sound alike. You ask it to clean up some old records, and it calls delete_record when the task was to archive them. You ask it to look up a customer, and it fires a broad web search instead of the database lookup that would have answered in one call. The model understood the request. It just picked the wrong instrument from a tray with too many that look the same.
This isn’t a rare edge case, and it isn’t fixed by a smarter model. It’s a direct function of how many tools you put in front of the agent and how similar they are to each other. The more options on the menu, the more ways there are to choose a plausible wrong one, and the model’s confidence stays high the entire time it’s reaching for the wrong handle.
Fifty tools, one wrong pick
Tool selection is a matching problem, and it sits at the core of how AI behaves when it is wired into your tools. The agent reads the names and descriptions of everything it can call, compares them against what it thinks the task needs, and picks. When the options are distinct, this is easy. When two tools overlap in meaning, the match gets fuzzy, and fuzzy matching on consequential actions is how you end up with a deleted record that should have been archived.
Microsoft Research catalogued 1,470 MCP servers and gave this failure a name: “tool-space interference,” which they define as situations where “otherwise reasonable tools or agents, when co-present, reduce end-to-end effectiveness.” Their examples include the obvious traps, a cluster of near-identical names like search, web_search, and bing_search sitting side by side, any of which the model might grab. Each tool is reasonable alone. Together they form a menu where the right choice and three wrong ones all look equally valid.
Decision Management Made Easy
A support agent has tools for refunds, store credit, account holds, and cancellations. A customer asks to pause their billing. “Pause,” “hold,” “cancel,” and “suspend” are close enough in meaning that the agent can confidently cancel an account when the human wanted a two-week hold. Nobody wrote a bad description. There were just four doors that all looked like the right one.
And the cost lands on the customer, not the model. A cancelled account means a lost subscription, a re-signup flow, maybe a churned customer who never wanted to leave. The agent reported success, because from its side the call worked fine, the tool returned a clean result. Nothing in the transcript says “I think I picked the wrong action.” You find out when the customer calls back angry, and by then the wrong tool already ran.
The reasoning wasn’t the failure. The agent reasoned its way to a tool tray that offered four near-synonyms and trusted it to pick the one the human actually meant.
More tools, less accuracy
Here’s the part that should change how you build. The vendors who sell these models tell you, in their own docs, that accuracy drops as the tool count climbs. Per Microsoft’s writeup, OpenAI caps developers at 128 tools and its documentation recommends going nowhere near that: “Keep the number of functions small for higher accuracy” and “Aim for fewer than 20 functions at any one time.” That’s the model’s own maker saying the menu length is a precision dial, not a free upgrade.
The academic work lines up with the vendor advice. Varatheepan Paramanayakam and colleagues found that selectively cutting the number of tools available to a model significantly improves its function-calling, the exact opposite of the give-it-everything instinct. And Ruocheng Guo’s team put a finger on why: tool descriptions get written for human developers and “tolerate ambiguity that agents cannot resolve, particularly as the number of candidate tools grows.” Every near-synonym you add is one more way for the menu to be misread.
So why do so many agent setups hand the model dozens of tools at once? Because it’s basically easier to expose everything than to decide what each task needs. Connect a few MCP servers, each bundling its own twenty or thirty tools, and you’ve quietly handed the agent a hundred-item menu for a job that uses three of them.
This is where plug-and-play tooling turns into a quiet tax. Each MCP server you connect feels free, it’s one line of config, and suddenly the agent can do more. But the model pays the bill on every decision, because it now reads and weighs that whole expanded menu each time it picks a tool. Microsoft’s name for the broader pattern, tools that individually make sense but collectively drag down performance, is tool-space interference, and cobbling servers together indiscriminately is the fastest way to manufacture it. More capability on paper, less reliability in practice, and the trade stays invisible until the wrong-tool calls start showing up in your logs.
The accuracy cost is invisible in a demo and obvious in production. A demo asks one clean question against a tidy toolset and the agent picks right. Production runs messy requests against a bloated menu all day, and the wrong-tool rate that looked like zero becomes a steady drip of archived-instead-of-deleted, cancelled-instead-of-held. Nothing about the model changed between the demo and the rollout. The menu got longer.
Fewer tools, fewer ways to be wrong.
This is not the security problem
It’s worth drawing a hard line here, because this gets confused constantly. Whether an agent is allowed to call a destructive tool at all is a security and permissions question, and we’ve covered it separately in binding agents to a workflow instead of letting them free-roam and in the two layers of MCP authorization. That’s about the door: who can call what, with which credentials, audited how.
Wrong-tool selection is a different question that sits one layer in.
Of the tools the agent is fully authorized to use, which one does it actually reach for?
You can pass every security check, every permission gate, every audit requirement, and still archive the thing you meant to delete, because the agent picked a tool it was completely entitled to call. It just wasn’t the right one for the task.
Missing this distinction sends teams down the wrong road. They answer a wrong-tool incident by tightening permissions, adding approval gates, locking down credentials, and the destructive mistakes keep happening, because the agent had every right to do what it did. Permissions answer “is this allowed.” They say nothing about “is this the right choice among the allowed options,” and that second question is where the accuracy actually lives.
The two problems rhyme, though, and the security crowd already found the shape of the answer. Simon Willison, writing up a paper on securing agents against prompt injection, landed on the guiding principle that once an agent “has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions.” That’s a safety argument, but the mechanism, narrow what consequential actions are reachable at any moment, is exactly what fixes the accuracy problem too. Constrain the menu and you get fewer security surprises and fewer wrong picks, from the same move.
A workflow step hands over three tools, not fifty
So here’s the fix, and it’s almost boring. Instead of giving the agent every tool up front and hoping it chooses well across the whole job, you scope the tools to the step. At the “process this refund” step, the agent sees the refund tool and maybe a lookup. It does not see delete, cancel, archive, or the other forty things that have nothing to do with refunds. The crowded menu that caused the wrong pick simply isn’t presented.
That’s what a workflow gives you for free. A defined process already breaks a job into named steps, and each step already knows what it’s for, so scoping its tools to that purpose is the natural next move rather than a bolt-on guardrail. The agent’s choice at any moment collapses from “the right one of fifty” to “the right one of three,” which is squarely inside the accuracy zone OpenAI’s own docs point at. Our MCP server exposes 100+ tools across the whole platform, but a single workflow step never dumps all of them on the model. The step decides which slice the agent gets to see, the same way automation rules and approval gates decide what happens next.
Take employee onboarding. The “create accounts” step gives the agent the provisioning tools and nothing destructive. Its “collect tax forms” step hands over a document-request tool and a validation tool, not the account tools from the step before. A later “schedule first-week training” step offers only a calendar tool. At no point does the agent see all of onboarding’s tools at once, so it can’t reach across steps and grab the wrong one, because the wrong ones aren’t in the room. The process did the narrowing the model couldn’t reliably do for itself, and it did it almost by accident, just by being a defined sequence of steps that each know their job.
And notice this isn’t a guardrails product or a new layer of policy to maintain. It’s a side effect of designing the work as a process in the first place. You scoped the tools because you defined the step, not because you bought a tool-restriction feature. That’s the same reason the established agent patterns all externalize structure instead of trusting the model to hold it, and the same reason an AI agent needs a workflow engine underneath it.
Scope the tools, not the model
What caught us off guard, watching agents work against real tool sets, was how cheerfully a model reaches for a near-neighbor and never signals doubt. There’s no hesitation in the output, no “I’m only 60 percent sure delete is right here.” It just calls the tool and moves on, which means you can’t catch the wrong pick by reading the agent’s confidence. You catch it by not offering the wrong tool in the first place.
“Won’t that limit what the agent can do?” is the reasonable objection, and the answer is no. It limits what the agent can do at any single step, which is the entire point. Across the whole process the agent still touches every tool it needs, just never all at once. A surgeon has a full tray in the room and a focused set on the table for the incision in front of them. Nobody calls that a limitation. It’s how you avoid reaching for the wrong instrument mid-procedure, and an agent earns the same benefit from the same discipline.
The thing is, this is the easiest reliability win on the whole list. You don’t have to retrain anything, evaluate anything, or wait for a better model. You just stop handing the agent a fifty-item menu for a three-item task. Define the steps, scope each step’s tools to its job, and the most common wrong-tool mistakes become impossible rather than merely unlikely.
And it compounds in your favor as the system grows. Add a new step later and you scope its tools to its job, the same as the rest. The agent never accumulates a sprawling all-access toolset, because no single step ever needed one. Compare that to the bolt-on approach, where every new capability widens the menu the model has to wrangle on every call, and reliability quietly erodes as the product gets more capable. Scoping by step means the thing gets more useful without getting harder to trust.
So before you wire an agent into a pile of MCP servers and turn it loose, count the tools it can see at the moment it has to choose. If that number is in the dozens, you’ve built the wrong-pick problem in by design. Narrow it to the few each step genuinely needs, and you’ve done more for reliability than any model upgrade will. That’s the unglamorous truth here: an agent does better with a shorter menu, the same as the rest of us.