Summary
- Legal AI got better and still isn’t safe to file unread - Stanford’s 2024 benchmark found purpose-built tools like Lexis+ AI wrong more than 17% of the time and Westlaw’s AI-Assisted Research more than 34%, even though both cut errors sharply against a general chatbot.
- What separates the teams who succeed with it? Not a sharper model. A verification step that checks every machine-generated citation against the actual case before anything reaches a filing queue.
- The duty is already on the books - the ABA’s competence rule asks lawyers to understand the risks of the technology they use, which makes a skipped check a professional-conduct problem, the kind a bar takes seriously.
- Verification is a workflow you build - put the source next to the claim, route failures back, record the sign-off. See how Tallyfy structures the review step
The hallucination crisis didn’t kill legal AI. It sorted the people using it into two groups.
One group bolted a chatbot onto their filings, trusted what came back, and a painful share of them ended up named in a sanctions order. The other group kept pointing AI at the same drafting and research work, pulled real value out of it, and stayed clear of trouble, because they wedged a step between the model’s output and the courthouse door. That step is the entire difference. The liability side of this, who answers when the fake citation gets filed, is its own subject, and the courts have been blunt about it. This post is about the quieter half: what the teams who use legal AI well do differently, and why what they’re doing is a verification workflow rather than a hunt for a model that finally stops making things up. It’s a small idea with a long reach, and it sits near the center of how AI is reshaping who does the work and who has to check it.
Approval Management Made Easy
Start with the base rate
Here’s the number to anchor everything else to. In 2024, researchers at Stanford’s RegLab and HAI, among them Varun Magesh, Faiz Surani, and Matthew Dahl, ran the legal AI tools that vendors had marketed as hallucination-free against a real benchmark. They weren’t hallucination-free. Lexis+ AI and Thomson Reuters’s Ask Practical Law AI, both built on curated legal databases rather than a raw chatbot, produced incorrect information more than 17% of the time, and Westlaw’s AI-Assisted Research did it more than 34%. The study title said it without hedging: legal models hallucinate in one out of six benchmarking queries, or more. And the part the dread-pieces skipped is the part that matters most for planning, because those same tools cut errors hard against a general chatbot, which fabricated on 58% to 82% of legal queries in the same tests.
So the purpose-built tools are a genuine improvement. They’re just not an improvement you can file without reading.
Sit with what a one-in-six error rate does to a real practice. File ten briefs a month on raw output and you’re shipping unverified law at a clip no partner would sign off on if they saw it written down. The trouble is you can’t see it written down, because the wrong sixth isn’t flagged. It looks exactly like the right five.
A fabricated case reads with the same confident syntax as a real one, which is the whole reason these tools fooled experienced lawyers in the first place.
So how many unread filings does it take before one of them is the fabricated sixth? On a busy desk, not many. Nobody named in those sanctions orders set out to file a fake case. They trusted a tool that’s right most of the time and skipped the step that catches the rest, and being right most of the time is exactly the trait that talks a careful person out of looking.
You don’t know which sixth is fake until you check.
That’s a design fact about how the tool works, and no patch erases it. That said, the rate does drift down as models improve, and the Stanford gap between bespoke and general tools shows it can drop a lot. But “a lot better than 82%” still lands at a wrong answer every few queries, scattered unpredictably across your filings. This is what makes verification arithmetic instead of nerves, and it reframes a narrower question every operations leader is sitting with, which is how you trust any AI tool before you lean on it.
What the teams who succeed actually do
The pattern under every team that gets value from legal AI is the same, and it’s dull on purpose. They point the model at the work it’s good at, which is finding candidate cases, summarizing long records, and drafting first-pass language, and then they treat every citation it produces as unconfirmed until a person does three specific things to it. Locate it, read it, match it. Locate the case, meaning confirm it exists in a real reporter and not only inside the model’s sentence. Read what it actually held, not the tidy paraphrase the tool wrote underneath it. Match it to the proposition it’s being cited for, because a real case quoted for something it never said is its own species of fabrication, and the sneakiest one. That work is human, and what it needs is a step the unverified citation can’t slip past.
Here’s the counterintuitive part. The teams pulling the most out of legal AI are the ones who trust it least.
They start from the assumption that the output is wrong and build the step that catches it, and that single assumption is what frees them to use the tool aggressively everywhere upstream. The drafting gets faster. The first-pass research gets faster. What stays slow, deliberately, is the one move where speed turns into a sanctions risk. Locate-read-match is cheap when the source is sitting next to the claim and expensive when it means a scavenger hunt through a database, which is why the teams that scale it write the AI’s job down as a process with the verification built in, rather than leaving it to whoever remembers.
Notice what the division of labor is really doing.
The AI takes the mechanical bulk, the finding and the summarizing and the first rough draft, and the person takes the one thing the AI can’t do, which is decide whether each claim is actually true. That split only holds if the process draws the line in the right place and keeps it there, because an AI handed the judgment call will answer it with the same fluent confidence it brings to everything else. A model makes whatever process you wrap around it run faster, not truer. Wrap it in a real verification step and you get careful work at speed. Wrap it in paste-and-file and you get the one in six at speed.
The check people skip is the third one. Most lawyers will notice a citation to a case that doesn’t exist. Far fewer will catch a real case cited for a holding it doesn’t contain, because the citation passes the eye test, the reporter is real, the names check out.
A model can hand you a genuine appellate decision attached to a proposition that decision actually rejected, and the sentence around it will read as authoritative as any you wrote yourself. Only reading the case closes that gap, and the model is structurally unable to close it for you, since the confident summary is the exact artifact you’re trying to verify. That’s why “the AI cited a source” isn’t the finish line. It’s the start of the third check.
Is verification optional? The bar already answered
The duty is older than the technology, which is what makes it bite. ABA Model Rule 1.1, Comment 8, asks a lawyer to keep abreast of changes in the law and its practice, “including the benefits and risks associated with relevant technology,” and most state bars have since adopted some version of that competence language. Read it next to the Stanford number and the obligation turns concrete. If a competent lawyer is expected to understand the risks of the tools they use, and the documented risk of a legal AI tool is a fabricated citation every few queries, then filing its output unchecked isn’t a clever shortcut. It’s a competence question with your name on it. Stack the older duty of candor toward the court on top, and the verification step stops reading as good hygiene and starts reading as the floor.
The rules didn’t change for AI.
What changed is the volume of unverified output flowing toward the courthouse, and the rules are absorbing it without strain. That should be reassuring rather than alarming, because it means the standard you’re being held to is one the profession already understood: stand behind what you submit. A verification workflow is just the operational form of a duty lawyers have always carried. The firms treating it that way aren’t waiting for a bar association to publish AI-specific guidance before they act, since the guidance that governs the situation is already written and has been for over a decade.
This framing also happens to be the one that wins the internal argument. Pitch a verification step as pure risk-aversion and it’s a tough sell, because you get told the team is careful and the meeting moves on. Pitch it as the documented professional standard, the thing a bar would expect of a competent lawyer using a tool that misses one query in six, and the step tends to get approved, because now the position that needs defending is the one where you decided your people didn’t need the check. Which would you rather explain to a disciplinary panel: that you built the gate, or that you skipped it on purpose?
Worth saying plainly, since I run a workflow company and not a law firm: this is the shape of the duty and not legal advice on your matter. But the direction is hard to miss. When the competence rule and the candor rule both point at the same missing step, the step isn’t optional.
Build the gate, not a better prompt
Telling lawyers to verify is a poster on a wall. Building verification into the workflow is a step the draft can’t route around, and that distinction is the whole game. A reminder depends on a busy associate remembering it at 7pm; a gate doesn’t care how busy anyone is, because the AI’s draft simply can’t reach a filing queue without passing through the verification step first. The model drafts, the output parks at that step, a named reviewer gets the draft with the source material attached to the same task, and only a recorded pass moves it forward while a fail routes it back with a note. Reacting to the Stanford study on Hacker News, a commenter going by dcchambers called the destination early: “AI does most of the work, but real people will still be required to audit and verify basically everything.” The workflow is what makes “verify everything” survivable instead of a daily act of willpower.
Teams that do this on every matter tend to stop reinventing the step and template it instead, so the review criteria travel with the work rather than living in one careful person’s head.
Cheap is the operative word.
The verification step has to cost minutes, not an afternoon, or busy people will quietly route around it and you’re back to one in six. In practice that means the reviewer opens a single task and finds the draft, the cited cases, and a short list of what to confirm all in one place, instead of chasing PDFs across a shared drive to reconstruct what the AI even relied on. Make the check expensive and it gets faked. Make it the path of least resistance and it gets done. That’s the entire reason this belongs in a workflow rather than a training memo nobody reopens.
A mistake we made early on at Tallyfy was assuming the thing customers wanted from automation was speed. The ones doing regulated work wanted close to the opposite at the decisive moment: a place to slow down on purpose, once, at the step where a missed check turns into a filed error. Everything else in their process could run fast. The review step earned its slowness by being the only thing standing between a confident draft and an irreversible submission. That’s also why a real gate needs somewhere for a rejection to go, so the loop of draft, check, fix, recheck actually closes instead of dead-ending, which is the kind of routing a process with conditional steps handles without anyone babysitting it.
Where a verification workflow stops helping
A verification workflow has limits, and a vendor who hides them is one to distrust. Here are the three that matter. A gate can’t rescue a careless check. Hand the review to someone who clicks pass without reading and you rebuild the original failure with a timestamp on it, so the design job is making the real check cheap enough that nobody is tempted to fake it, which means small units, attached sources, and clear criteria rather than a heavier signature. A workflow also can’t tell you where the law settles. Vendor exposure, negligence standards for agent deployments, how any of this plays outside the US, those belong with your counsel and not your process tool. And the deeper your automation runs, the more the verification step is doing work the model can’t, since the thing you’re checking is the very output the model is most confident about.
The last limit is the useful one. This was never a legal-only problem.
The exact shape, where AI drafts and a person verifies against the source and the process records that it happened, is what clinical teams are building around medical AI and what pharma’s GxP rules already require of any system touching a validated record. Legal just arrived at the lesson loudly, through sanctions, with names attached. The professions that handle it well won’t be the ones whose models never erred, because every model errs. They will be the ones who can show, for any consequential thing the AI touched, that a person checked it before it counted. Everyone running AI near work that matters lands in the same place eventually, because a faster wrong answer is still wrong, and the one thing that reliably catches it is a step you put there on purpose.
The crisis didn’t end legal AI. It ended the version that files unread.
The model will be wrong one query in six by design, and the verification step is the difference between a caught error and a filed one.