Summary
- Regulators ask a different question than your data scientists - accuracy is a model metric, auditability is a workflow property. A 2025 arxiv study of Ryt Bank describes the first regulator-approved deployment worldwide where conversational AI runs a bank’s primary interface, built on “deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture.”
- What makes an AI financial action defensible? Not a higher accuracy score. An agent right 95% of the time with a fully traceable chain beats one that is right 99% and opaque, because only the first can be reconstructed when an examiner asks.
- Most agent deployments invert the priority - a second 2025 paper on agentic financial-crime compliance notes that most AI solutions “remain opaque and poorly aligned with regulatory expectations,” and argues for bounded roles plus audit logging instead.
- The fix is a process, not a smarter model - wrap the action in a defined workflow with a named approver. See how Tallyfy structures approval gates
A bank examiner doesn’t open with a question about model accuracy. The examiner asks something colder: show me why this specific decision got made, who signed off, and prove the record hasn’t been edited since. If you can answer that for every consequential action, a system that is right 95% of the time is fine. If you can’t, a 99%-accurate one is a liability waiting for an exam.
That inversion trips up almost every team wiring AI into regulated, high-stakes work. Engineers optimize the number on the benchmark. Regulators optimize for reconstruction, which is the ability to walk backward from an outcome to the reasoning and the human who owned it. The two goals overlap. They are not the same, and in finance the second one is the one that gets you fined.
So when an AI agent starts executing real financial actions, whether moving funds, flagging a transaction, or clearing a customer through onboarding, the question that decides whether you survive an exam isn’t how often it’s right. It’s whether you can reconstruct what it did and why.
This post builds that second answer from two 2025 research deployments, and it rests on one unglamorous claim: the audit trail beats the model.
Why do regulators care more about reasoning than accuracy?
Because their job isn’t to grade your model. It’s to reconstruct a single decision after the fact and judge whether it was defensible, and a system that can’t show its work makes that impossible no matter how often it lands the right answer. A supervisor reviewing a disputed wire transfer doesn’t get to re-run your model a thousand times and read off an average. They get one event, after it happened, and they need to know what the agent saw, which rule it applied, and who let it through. If those facts aren’t recorded, the accuracy rate is a number with nowhere to stand. The score describes a population. The exam is about an individual.
Henrik Axelsen, Valdemar Licht, and Jan Damsgaard put it plainly in a 2025 paper on agentic AI for financial crime compliance. Most AI solutions, they write, “remain opaque and poorly aligned with regulatory expectations” - the accuracy is fine, the explainability isn’t.
Approval Management Made Easy
Their system, designed with a fintech firm and regulatory stakeholders in the room, leans the other way. It automates onboarding, monitoring, investigation, and reporting, and it “assigns clearly bounded roles to autonomous agents and enables task-specific model routing and audit logging.” The whole design emphasizes “explainability, traceability, and compliance-by-design.” Notice what carries the weight in that sentence. Not a bigger model. Bounded roles, logging, and a design that assumes someone will ask later.
A benchmark score tells you how often a model is right across a test set. It tells a regulator nothing about one transfer that went wrong on a specific Tuesday afternoon. The examiner doesn’t want a distribution. They want the inputs that fed that one decision, the rule the agent applied, and the name of the person who approved it.
What caught us off guard early, talking with operations teams in regulated work, was which question the regulators lead with. Not “is the model good.” It’s “can you rebuild this decision without phoning the vendor.” An agent that produces a clean answer and no reconstructable path has, from a supervisor’s seat, produced nothing they can actually use.
Accuracy is necessary, but it isn’t what’s being examined.
Inside a regulator-approved AI bank
The cleanest existence proof I’ve read lands in a 2025 arxiv paper by Xin Jie Chua and colleagues, titled “Banking Done Right.” It documents Ryt AI, the framework behind Ryt Bank, which the authors describe as “the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface.” That phrasing is deliberate, because the paper draws a line between Ryt and earlier bank assistants that were “limited to advisory or support roles” and never touched the money. Here, customers don’t tap through screens to move funds. They talk, and the system executes core transactions straight from the conversation, running on an in-house model the authors call ILMU. For a regulated bank, letting a language model sit on the primary interface is a bold thing to get a supervisor to approve. How they got it approved is the lesson worth copying.
The bank didn’t earn approval by claiming a flawless model. It earned approval through structure.
The paper describes four narrow agents, named Guardrails, Intent, Payment, and FAQ, each attached to that internal model, and then states the safety design in one sentence: “Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance.” Read that list back slowly. None of those three things is the model itself. A deterministic guardrail is a rule that fires whether or not the model agrees with it. Human-in-the-loop confirmation is a person approving the action before it commits. A stateless audit architecture is a record built so the decision can be replayed later.
So the intelligence proposes, and the structure around it decides what’s allowed to happen and writes down what did.
That ordering is what turns a clever demo into something a regulator will approve. The model gets to be probabilistic in the middle, because the edges are deterministic and recorded. Strip those deterministic rules and the confirmation step and the audit architecture away, and you’re left with a chatbot moving money on a hunch, which no supervisor on earth will sign. The interesting part of Ryt isn’t that the model is good. It’s that the bank built as if the model were the least trustworthy component in the chain, and engineered around that.
That’s a tough sell to an engineer measuring a leaderboard, and it’s the right call anyway.
Wrap the financial action in a defined process
Strip both papers down and the same shape falls out. A defined process feeds the AI step; the AI step exposes its inputs and its proposed output; a human with a name approves anything that moves money or status; and an append-only log records the whole chain so it can be rebuilt later. The agent is one bounded participant inside that sequence, never the sequence itself, and that single design choice is what most “autonomous agent” pitches quietly skip. Those pitches sell the opposite: an agent that decides and acts end to end, with the human edited out as a bottleneck. In a regulated firm, the human isn’t the bottleneck. The human is the accountable party the whole record points back to. Edit them out and you haven’t built an agent a bank can run, you’ve built a faster way to reach a decision nobody can defend.
The order matters more than it looks. Put the AI step before a human approval gate and the agent’s confidence never reaches the ledger on its own. Put the logging at the workflow level rather than inside the model, and the trail survives even when you swap the model out next quarter. That said, none of this is a financial-services trick. It’s the ordinary discipline of a loan approval workflow or an AML compliance program, with one step now handled by a model instead of a junior analyst.
The model changes. The shape of the process shouldn’t.
The same templates regulated teams already run make good starting frames once you decide where the AI step sits:
Each of those has the right bones already: a defined entry, checks along the way, and a sign-off step that someone owns. Dropping an AI step into the checking parts, while keeping the sign-off human, is most of the work. A model can read the application against the rules and surface what’s missing. It can pull the watchlist hit into view and draft the rationale a reviewer will confirm or reject. What it doesn’t do is clear the customer, because clearing is the committing step, and committing is where a human name belongs.
The hard part was never the model. It’s writing the process down clearly enough that a step can be handed over at all, which is the same wall every team hits the first time they try this.
Traceability is a property of the process
Here’s the distinction that decides whether your AI deployment is auditable: traceability is something the workflow produces rather than a feature you bolt onto a model. A model can be coaxed into explaining itself, and that explanation can still be wrong, incomplete, or a bit different on a re-run. A workflow record is just what happened, captured in order, as it happened. One is a story the model tells about itself. The other is evidence. When a regulator pulls a file, they want evidence, and the difference between the two is whether the record was generated by the work or narrated after it. A persuasive explanation you can’t check against what the system actually did is worth less to a supervisor than a dull log you can, which is exactly why the boring record wins the audit.
Something we learned slowly, as AI crept into regulated decisions, is that the trail has to be a byproduct of the work rather than a thing someone assembles after an examiner calls. Reconciling logs by hand the week before an exam is a messy, error-prone scramble, and it’s exactly the scramble supervisors read as a warning sign. The whole point of the agentic-compliance paper’s “audit logging” and Ryt’s “stateless audit architecture” is that the record accumulates on its own, because the process generated it.
You don’t write the trail. The work does.
In Tallyfy terms, the AI’s contribution lives inside a step, and the step that follows is a blocking approval with a named owner - a gate the action can’t move past without a signature, rather than a notification someone can wave through. The run history then records every step as it happens, so the reconstruction an examiner wants exists by default: what the agent saw, what it proposed, who approved it, and when it moved. This is the same reasoning behind tamper-evident audit trails for agent actions, applied to money instead of medical records.
Turns out the dull part of the system is the part a regulator trusts.
A reviewer still signs. What changes is that the reviewer sees a pre-checked proposal instead of a blank form, and the record assembles itself inside the workflow rather than inside someone’s memory. And when the agent does get something wrong, the question of who answers for the wrong call has a clean answer, because the gate has a name attached to it.
The gate is where accountability stops being abstract.
Which steps can an agent safely touch?
The useful way to sort a bank’s steps isn’t by department or by how clever the model is. It’s by what each step does to the outside world. Some steps only propose: they look something up, check it, or draft a result a human will weigh, and a wrong proposal there costs nothing worse than a few seconds of that reviewer’s attention. Other steps commit: move the funds, clear the customer, release the filing, and a wrong commit is the exact event an examiner reconstructs months later. Put the agent on the proposing steps and keep a human on every committing one, and you’ve decided where AI is safe in a regulated firm without running a single benchmark. Most teams reach for the committing end first, because an agent that decides things makes the better demo. Resist that.
The proposing steps are where AI belongs today. A model reading a transaction against a watchlist, or comparing an application to what the rules demand, does work that’s tedious and time-sensitive at exactly the moment it’s cheapest to catch a problem. The KYC onboarding process is full of these: gather the documents, verify identity, score the risk, flag the mismatches. None of those steps commits anything on its own. Take a sanctions screen: the model surfaces a possible name match with its supporting context, and a compliance officer decides whether it’s actually the same person. That’s about as humble as a financial AI deployment gets, and it’s the right level of ambition for a first one.
A person owns every committing step, because that name is what an examiner traces back to.
This is the same lesson the reliability math of multi-step agents keeps teaching: the more consequential actions you let an agent chain together unsupervised, the faster the odds of an unrecoverable mistake climb. Containment beats raw capability when the downside is a regulatory finding. A painful one, in fines and remediation, the kind that lands long after the demo impressed everyone in the room.
Transaction monitoring is a good second step once the reading works. An agent watches the flow, scores the anomalies, and assembles a case file with the evidence a human investigator would otherwise spend an hour gathering by hand. The investigator still decides whether to escalate or file. What the agent saves is the gathering, not the judgment, and the case file it builds becomes part of the audit trail rather than a thing reconstructed later. That’s the whole pattern in miniature: the model does the legwork, the human makes the call, and the workflow holds both so an examiner can replay the sequence months from now.
Then let the structure carry the weight, the way the wider move to workflow automation already does for work that has no AI in it at all. Define the process. Put the AI step where it reads and checks. Keep a human on every step that commits. Log the whole thing at the workflow level so the trail is a byproduct, not a project you dread.
A 99%-accurate agent that can’t explain a single decision is a liability in a regulated firm. A 95%-accurate one inside a workflow that records every move is an asset you can defend in an exam.
Build for the second one.
The model will keep getting better on its own. The trail won’t, unless you build the process that produces it.