Summary
- RAG is a one-shot pipeline, not an agent - it retrieves a passage, writes an answer, and forgets the whole exchange. There’s no second look, no plan, no way to notice it grabbed the wrong source. Calling that an agent is a category error that sets the wrong expectations from day one.
- Retrieval fails at a rate you can measure - Anthropic’s own research found a standard setup missed the right information 5.7 percent of the time, and even a heavily engineered version still missed about 1.9 percent, roughly one retrieval in fifty. A one-shot system can’t tell a thin retrieval from a good one.
- Bolting on autonomy makes it worse, not better - turning RAG into a free-running agent reintroduces the compounding-reliability collapse. More autonomy is not the same as more structure.
- A defined process is the real fix - wrap retrieval in explicit steps that verify, re-query, cross-reference, and escalate to a person when confidence is low. See how Tallyfy structures that
Here’s the claim most “we deployed AI” announcements quietly depend on, and it doesn’t hold. A retrieval-augmented generation system, the thing behind most internal chatbots, takes your question, fetches a few relevant chunks of text, and writes an answer from them. One pass. Then it forgets everything and waits for the next question. That’s a search-and-summarize pipeline wearing a conversational coat, and treating it as an autonomous agent is where the trouble starts.
An agent, in any meaningful sense, can look at a result, decide it isn’t good enough, and do something about it. RAG can’t. It retrieves once and generates once, and if the retrieval was thin or wrong, the answer is confidently wrong with nothing in the loop to catch it. That single gap sits at the center of what it takes to get reliable answers out of AI, and it’s why so many RAG deployments demo beautifully and then quietly mislead people in production. The model didn’t get dumber. It just never had a way to check its own homework.
Workflow Automation Software Made Easy & Simple
A writeup by the developer laxmansharma, shared on Hacker News as “Why Your RAG Isn’t an Agent”, put the boundary plainly: “Linear workflows hit a dead end when faced with complex tasks requiring iteration, planning, or self-correction.” That’s the whole problem in one sentence. The honest move isn’t to pretend your pipeline is smarter than it is. It’s to wrap it in something that can iterate, plan, and correct, which turns out to be a process, not a personality.
What RAG actually does
Strip away the marketing and RAG is two steps glued together. Step one, retrieval: turn the question into a vector, search a database of document chunks, pull back the closest matches. Step two, generation: hand those chunks plus the question to a language model and let it write. The retrieved text is supposed to keep the model honest, grounding the answer in your actual documents instead of whatever it half-remembers from training.
When the right chunk lands in the model’s hands, this works well. The catch is everything riding on that “when.” Retrieval is a similarity guess, not a lookup, and similarity is not the same as relevance. Ask about a refund window and the database might hand back last year’s policy because the words overlap. The model has no idea the chunk is stale. It writes a fluent, plausible, wrong answer, and the reader has no reason to doubt it.
Chunking makes this worse before it makes it better. Your documents get sliced into passages of a few hundred words so they fit the retrieval model, which means a policy and the exception that overrides it can land in separate chunks. Retrieve the first, miss the second, and the answer is right about the rule and silent about the carve-out that mattered. The system did its job, it returned a relevant chunk, and it still produced a misleading answer, because relevance to the question is not the same as sufficiency for the task. Nobody in the loop is positioned to notice the gap.
That’s the part the chatbot framing hides. You typed a question and got a paragraph back, so it feels like a conversation with something that understood you. What actually happened was a database query and a paragraph generator, run once, with no judgment about whether the query returned the right thing.
When retrieval misses, nothing notices
People assume the failure mode is the model hallucinating. More often the model is fine and the retrieval was wrong, and that distinction matters because you fix them in completely different places. Anthropic’s research on contextual retrieval is refreshingly blunt about the rates. A standard embeddings setup failed to surface the right information in its top results 5.7 percent of the time. Stacking contextual embeddings, a keyword index, and a reranker on top cut that failure rate by 67 percent, down to 1.9 percent.
Sit with that second number for a second.
The most heavily engineered version money can buy still misses roughly one retrieval in fifty.
One in fifty sounds tiny until you run a few thousand queries a week through it. That’s dozens of confidently wrong answers, every week, each one shaped exactly like a right answer. How would you even spot them, when each one looks just like the answers that were right? And a one-shot RAG system has no mechanism to tell the difference, because checking would require a second step it doesn’t have. It retrieved, it generated, it’s done. The miss sails straight through to whoever asked, who then acts on it.
Walk it through with a benefits assistant, the kind of internal tool plenty of companies have built or are quietly building. An employee asks how much parental leave they get. The assistant retrieves a policy chunk, writes a clear, friendly answer, and the employee plans around it. What it retrieved was the old policy, superseded three months ago by a version sitting in a different document the search ranked lower. No error fired. The answer read perfectly. The employee made a real decision on stale information, and the mistake surfaces weeks later when HR contradicts the chatbot the company told everyone to trust.
That’s not a hallucination you can scold the model for. It retrieved what it found and summarized it faithfully. The missing job was checking whether what it found was the current truth.
This isn’t one vendor’s quirk, either. Researchers led by Scott Barnett catalogued seven distinct failure points across three production RAG systems in research, education, and biomedicine, and pointed out that RAG inherits the limits of the information-retrieval systems sitting underneath it. The model is downstream of every one of those, which is why blaming the model is usually aiming at the wrong layer entirely.
So no amount of fine-tuning closes this. You can spend months improving the embeddings and the rate gets better and never reaches zero. The gap isn’t a model-quality problem you can train away. It’s a structural one: a process with no verification step can’t verify.
Bolting on autonomy just moves the failure
The popular answer is to make RAG “agentic”: cobble together a framework that lets the model loop, call tools, re-query, and decide its own next move until it’s satisfied. The writeup that named the dead end pitches exactly this, RAG as one engine inside a self-directing agent that plans and self-corrects. It’s the natural instinct, and for genuinely open-ended research it has real merit.
That merit is real, so let’s be fair about it. If a human analyst would genuinely explore, follow a lead, change direction, and synthesize across a dozen documents, an autonomous loop is a reasonable shape for the problem, and the open-endedness is the feature, not the bug. The trouble is that almost nothing in day-to-day operations looks like that. Answering a benefits question, pulling a contract clause, summarizing a ticket history: these aren’t open-ended research, they’re bounded lookups with a right answer. Wrapping a bounded lookup in an open-ended agent is using a research tool to do a clerical job, and you inherit all the unpredictability while none of it buys you anything.
For most operations work, though, it trades one problem for a worse one. A free-running agent that decides its own steps is a chain of model calls, and chains multiply their failure rates. We ran the arithmetic in detail in why a 20-step agent fails most of the time: at 95 percent reliable per step, a twenty-step run lands right only about 36 percent of the time. Hand your shaky retrieval to an autonomous loop and you’ve stacked an unreliable retrieval on top of an unreliable controller, then asked the model to grade itself at every turn.
And an autonomous loop has to decide one thing the one-shot version never even attempted: when to stop. Keep going and it burns time and money re-querying a question it already answered. Quit early and it ships the thin result it should have flagged. We dug into that failure mode on its own in when AI agents loop forever, and retrieval is a textbook trigger for it, because “did I find enough?” is exactly the fuzzy judgment a model will answer wrong with full confidence.
If a model can’t reliably tell whether an answer is good, why would it reliably tell whether it has looped enough?
And here’s the rub: “let the model decide when it’s done” assumes the model can tell when it’s done. That’s the same assumption that broke the one-shot version. Autonomy doesn’t add judgment the model lacks; it just gives the missing judgment more chances to go wrong, faster. What you actually want is the iteration the dead-end writeup correctly identified, minus the open-ended self-direction that makes it unpredictable. You want the loop bounded and owned by something other than the model.
Retrieval needs a process, not a personality
So picture the same retrieval, wrapped in a defined process instead of handed to a free agent. Retrieve the candidate passages. Run a verification step that checks whether they actually answer the question, with a confidence threshold. If the answer comes back thin, re-query with different terms. Cross-reference against a second source for anything high-stakes. And when confidence stays low, escalate to a person instead of guessing. Each of those is a discrete, named step with a clear input, output, and owner.
That’s not a smarter model. It’s the evaluator-optimizer pattern, the one where a generation step and a checking step trade off until the result clears a bar, run as an explicit workflow rather than an improvised agent loop. The model still does the part it’s genuinely good at: reading a passage, judging relevance, drafting an answer. The process owns the part the model is bad at, which is deciding whether the work is finished and what to do when it isn’t.
The one piece of advice we’d hand anyone shipping retrieval into production: the verification step you skip to save a week is the one you’ll wish you had the first time a confident wrong answer reaches a customer.
To be clear about what Tallyfy is and isn’t here: we don’t do retrieval, we’re not a vector database, and we’d never claim to be. What a workflow platform does is hold the structure around the model, the verify step, the re-query branch, the human escalation, the automation rules and approval gates that decide what happens at each fork. The retrieval stays wherever it lives. The reliability comes from the process you put around it, and crucially, every one of those steps leaves an audit trail, so a wrong answer can be traced to the step that produced it instead of vanishing into a black box.
Run the benefits assistant back through that structure and watch where the old failure dies. The agent retrieves the leave-policy chunks as before. Then a verification step checks each chunk against an effective date and a list of canonical sources, and the superseded document fails that check instead of getting summarized. Because the good chunk didn’t clear the bar, a re-query fires with tighter terms and pulls the current policy. For a high-stakes number like leave entitlement, a cross-reference step confirms it against the official HR record rather than trusting a single retrieval. And if confidence stays low, the question routes to a person in HR with the candidate sources attached, instead of the employee getting a confident guess. Same model, same retrieval engine, completely different outcome, because four cheap checks now stand between “it returned something” and “we told the employee.”
Where this leaves your RAG project
A mistake we made early on, building AI into our own tooling, was treating a retrieval step as done the moment it returned something. It returned a result, the result looked reasonable, we moved on. What caught the confident-but-wrong answers later wasn’t the model re-reading its own work, basically a coin toss, it was a separate step that compared the answer back against the source it claimed to use. The fix was never a better model. It was adding the check we’d skipped.
There’s a tell for whether you have this problem, and you can check it without touching code. Pull ten answers your RAG system gave last week and trace each one back to the document it drew from. If you can’t, the system isn’t keeping the link, which means nobody can audit a wrong answer after the fact. If you can but it takes an afternoon of detective work, the trail exists and the process doesn’t surface it. A wrapped retrieval records which source fed which answer at each step as a matter of course, so the audit takes seconds. That record is worth as much as the accuracy in any setting where someone might later ask why the answer was what it was.
That’s the shift in one line: stop asking your RAG system to be an agent, and start giving it a process. The retrieval can stay exactly as clever or as basic as it is today. What changes the outcome is whether a thin retrieval gets caught, whether a low-confidence answer gets a second pass, and whether the genuinely ambiguous question reaches a person instead of getting a fluent guess. That’s the difference between a single task and a whole job, and retrieval is very much a single task.
None of this is an argument against RAG. Retrieval is genuinely useful, and for a lot of low-stakes questions a one-shot answer is completely fine, the cost of an occasional miss is that somebody re-asks. The argument is against deploying it unguarded for decisions that matter, then calling it an agent to paper over the gap. The moment a wrong answer costs real money, real compliance exposure, or real trust, the bare pipeline stops being good enough. And the honest fix isn’t a bigger model or a longer context window, it’s the unglamorous process work of deciding what gets checked, what gets a second pass, and what reaches a human.
Most “RAG isn’t working in production” complaints aren’t model complaints, they’re missing-process complaints. Before you swap the embeddings or chase a bigger context window, ask the cheaper question first. When this thing retrieves the wrong chunk, and it will, what in the system notices? If the answer is nothing, you don’t have an agent and you don’t have a reliable pipeline either. You have a confident guesser, and the way you make it trustworthy is the same way you make any unreliable step trustworthy, which is the heart of the matter: you wrap it in a process that checks.