Audit trails for AI agents - what regulators expect

Summary

Regulators want proof, not printouts - EU AI Act Article 12 requires that high-risk systems “technically allow for the automatic recording of events (logs) over the lifetime of the system.” A log table your engineers can UPDATE is a record of what the database currently says, which is not the same thing.
What does tamper-evident actually mean? Append-only records, each one carrying a hash of the one before it, so deleting or reordering anything afterward becomes provable. One compliance builder summed up the gap: logs can be edited, and self-attestation is just a trust claim.
One incident starts three clocks - a fintech developer’s tally from a February 2026 compliance thread: DORA Article 19 incident reporting in 4 hours, GDPR Article 33 breach notification in 72, AI Act human oversight stacked on top. Manually reconstructing agent activity does not finish inside those windows.
A defined workflow records itself - when the agent acts inside a process step, every transition, assignment, and approval lands in the trail automatically. See how that looks in Tallyfy.

Solution Process

Process Audit Software

Tallyfy is Process Audits Made Easy

Save Time

Track & Delegate Processes

Ensure Consistency

Explore this solution

A log line you can edit is a note, not evidence.

That distinction is the entire subject of this post, and it’s about to matter to anyone running AI agents in production. Here’s the short version up front: regulators reviewing AI systems expect every consequential action to be recorded automatically, attributably, and in a form nobody can quietly rewrite afterward. Most teams running agents today capture their activity in ordinary database tables - the kind any engineer with write access can UPDATE. The gap between those two states is wide, and closing it with bolted-on logging is miserable work. Closing it structurally, by running the agent inside a defined process that records every transition as a side effect, is basically free.

The regulatory text exists, the deadlines are real even after this week’s deferral, and the people building compliance tooling have been unusually candid about where teams fall short. So this post walks the expectation itself, the three regulations that can converge on a single incident, and why how AI is reshaping operational accountability keeps pointing back at the same old answer: define the process, and let the process keep the books.

Why is an editable log not evidence?

Because the question an auditor or regulator asks is never “what happened?” It’s “how do you know nothing changed since?”

In February 2026, a developer with the handle gibs-dev posted an Ask HN about EU AI Act compliance - a small thread, but the exchange inside it is the clearest articulation of this problem I’ve read anywhere. One reply came from alexgarden, a builder working on compliance tooling, who laid out why the documentation duty is harder than it reads: “Documentation gets stale the moment you deploy. Logs can be edited. Self-attestation is just a trust claim.”

Sit with that middle sentence. Logs can be edited.

Almost every agent deployment I’ve seen described stores its activity the obvious way: rows in Postgres, documents in Mongo, lines shipped to a log aggregator with a retention policy someone set in 2023. All of those are useful for debugging. None of them prove anything, because the team that writes the rows can also rewrite them, and a regulator has no reason to take “we would never” on faith. Self-attestation, as the man said, is just a trust claim.

The same commenter described what the credible version looks like: “Tamper-evident audit trails. Append-only, hash-chained, so you can prove nothing was deleted or reordered after the fact.” And then the line that turns the screw: “This is the difference between ‘we logged it’ and ‘we can prove we logged it.’”

Append-only means records get added and never modified - a correction is a new record, not a changed one. Hash-chained means each record carries a fingerprint of the previous record, so removing or reordering anything breaks the chain visibly.

Audit trail contrast: an edited database row loses history; appended step records carry the prior record's hash

Neither idea is exotic. Accountants ran append-only ledgers on paper for centuries, and git has made hash-chained history ordinary for every developer on earth. What’s new is regulators expecting it from the systems running AI - and the candid follow-up question in that same thread, from the original poster, was whether anyone actually implements hash-chaining in production yet “or is this still theoretical for most teams?” The regulation, gibs-dev noted, “requires record-keeping but doesn’t specify the technical standard, yet.”

We went through our own version of this education years before agents were the driver - the long arc from activity feed to compliance-grade record is a build story we’ve told separately - so none of the above reads as theoretical from where I sit. Turns out the hard part was never storing events. It’s resisting the urge to make stored events editable.

Article 12, translated

The legal anchor for all of this, for anyone in scope of the EU AI Act, is Article 12, and it’s mercifully short.

High-risk AI systems must “technically allow for the automatic recording of events (logs) over the lifetime of the system.” Three load points in one sentence. Automatic: a human exporting a CSV when asked doesn’t count. Events: actions and state changes, not summaries written after the fact. Lifetime: not the last 30 days because that’s what your log retention defaulted to.

The article then says what the logging has to be good for - identifying situations that may present a risk, supporting the post-market monitoring the Act requires elsewhere, and monitoring day-to-day operation. Read those purposes together and the shape becomes clear: the log is not a debugging convenience. It’s the primary artifact a regulator uses to reconstruct what your system did, which is why an editable one fails the assignment even when every row in it happens to be true.

Who has to care, and when? Fair questions, and the timing answer changed days ago. Under the Digital Omnibus agreement reached on May 7, as Covington’s summary lays out, obligations for standalone high-risk systems under Annex III now begin December 2, 2027 - a 16-month deferral - with product-embedded systems following in August 2028. The deferral’s full timeline - what moved and what didn’t, including the six-month log-retention floor that falls on deployers rather than builders - is a story of its own, so I won’t re-run it here. The relevant point for this post is narrower: the record-keeping requirement survived the renegotiation untouched. Brussels gave everyone more time to build the trail. Nobody was excused from building it.

And honestly, the regulation is the lagging indicator. Enterprise procurement teams already ask AI vendors how agent actions get logged and whether the logs can be altered - they ask because their auditors ask them. The contractual version of Article 12 arrives by questionnaire, and it does not wait for 2027.

Count the clocks that start in one incident

Here’s the operational scenario that makes editable logs go from embarrassing to expensive.

The same HN thread carried a reply from gibs-dev that stacked the regulatory math: “DORA Article 19 incident reporting (4 hours) + GDPR Article 33 breach notification (72 hours) + AI Act Article 14 human oversight - hitting all three during a live incident with manual lookups is not realistic.” And the conclusion: “That’s an API problem, not a legal review problem.”

Walk through what those clocks mean in practice. Your agent does something consequential and wrong - sends data somewhere it shouldn’t, executes a transaction it shouldn’t, classifies a customer in a way that triggers an obligation. If you’re a financial entity in DORA’s scope, a major incident wants initial notification within hours, not days. If personal data went sideways, GDPR’s 72-hour clock to the supervisory authority is already running. And the AI Act’s oversight provisions assume a human can establish what the system did and intervene.

Every one of those duties begins with the same mundane act: reconstructing what the agent actually did, in order, with timestamps, fast.

A question worth asking your own team cold: if an agent misbehaved at 2 p.m. today, how long until you could produce the complete, ordered list of its actions?

For most teams the honest answer involves one engineer who knows where the logs live, a couple of hours of grep, and a spreadsheet assembled under pressure - then a second pass when someone notices the agent also touched a system whose logs live somewhere else entirely. That reconstruction job is exactly what a regulator means by record-keeping, except they expect it to already exist before the incident, continuously, with nothing edited after the fact. One more wrinkle from the thread, because it’s the sort of detail only practitioners flag: compliance checks bolted on as middleware can fail open. gibs-dev had seen teams “bolt on compliance checks as middleware that silently degrades to ‘allow’ on timeout,” which is “worse than no check at all because you have a false paper trail.” A trail that lies confidently is the one outcome worse than no trail.

Agents change what a trail has to prove

One misconception comes up whenever audit trails meet AI agents: that an agent is just a fast user, so the logging you built for people will stretch to cover it. It won’t, for three reasons that compound.

Volume is the obvious one. A person on your team takes maybe a few dozen consequential actions a day, and if the record is thin you can ask them what happened. An agent can take thousands of actions an hour, across systems, around the clock. At that rate the trail stops being a supplement to human memory and becomes the only witness there is. A person can sit in a deposition and explain themselves. An agent’s testimony is its records - there is nothing else to ask.

Attribution is the subtler one. When an agent completes something, “who did this” has at least three correct answers: the model that generated the action, the integration or step that invoked it, and the human who authorized that step’s existence. A useful trail captures the distinction instead of flattening it, because the remediation differs for each - you retrain a model, you fix a step, you retrain a person - and a regulator will want to know which one your incident report points at. This is a problem we hit long before LLMs, with ordinary rule-based automation - our answer was to make the system itself a first-class, filterable actor in the activity record rather than crediting a human who wasn’t there, and the same logic extends to agents directly. Auditors investigating an incident filter by actor first. “Show me everything the automation did in March” has to be a query that just runs.

Scope is the third, and it’s the one that turns logging into structure. The counterintuitive thing we’ve noticed building for this: the best predictor of a clean audit trail isn’t logging discipline, it’s how narrowly the agent’s job was defined in the first place. An agent with broad standing permissions produces a trail that’s technically complete and practically unreadable - ten thousand actions with no boundaries around them, which a reviewer parses with a flashlight and a prayer. An agent that acts inside a defined step produces a trail that’s already organized: this step, this input, this output, this handoff. Same events. Completely different evidentiary value.

Messy trails get sampled and distrusted. Structured trails get read and believed.

Let the workflow keep the records

Which brings me to the structural answer, and to the reason I’d rather solve this with process design than with logging heroics.

When an AI agent operates inside a defined workflow - one step among several, with named humans on the consequential ones - the audit trail stops being a thing you remember to build and becomes a thing the work produces. Every step transition is recorded when it happens. Every assignment, every form submission, every approval and rejection carries an identity and a timestamp because the workflow cannot advance without them, so who approved which change is always on the record. The agent’s contribution sits in the same chain as the human contributions on either side of it, in order, with a live record of every step accumulating while the process runs. Nobody writes the trail. The trail is what running the process leaves behind.

That structure is precisely what the regulatory language keeps reaching for. Automatic recording over the lifetime of the system? A workflow engine records transitions automatically or it isn’t one. Attribution? Each step has an actor, human or system, by construction. The incident-response scenario from the previous section? The ordered list of what happened is the process history itself - the four-hour clock starts and the reconstruction is already done, waiting to be exported rather than assembled.

Procedure Example

AI Incident Response and Rollback Procedure

1Detect and classify AI incident

2Assess severity and impact

3Notify incident response team

4Contain the issue immediately

5Investigate root cause

+6 more steps

View template

I’ll be precise about what Tallyfy does and doesn’t claim here, because this category attracts overclaiming. The approval records our platform keeps are immutable - the approver can’t retroactively edit a decision, and neither can anyone else - and that’s a property we’ve written about in the SOC 2 context where auditors sample it for real. We are not a cryptographic ledger product, and when enterprise buyers used to pitch us on blockchain audit logs, we declined: strong access controls and hashing deliver what auditors actually test, without the latency theater. If your regulator ultimately demands a formally hash-chained store in the alexgarden sense, you’ll want a specialist layer for it. What a workflow platform gives you is the part that has to exist either way: a complete, ordered, attributed account of who and what acted, step by step, generated by the work itself - the same property that makes well-run workflow automation auditable with or without AI in the loop.

The cheap test of where you stand takes one afternoon: pick the single agent action in your stack with the biggest consequences if it goes wrong, and try to produce its complete history - what triggered it, what it saw, what it did, who reviewed it. If that history falls out of a process record in minutes, you’re closer to what regulators expect than most. If it requires three people and a log-diving expedition, you’ve found the work the next eighteen months are for.

Regulators aren’t asking for exotic engineering. They’re asking for the paperwork a defined process produces by default - and that an undefined one cannot produce at all.

Audit trails for AI agents - what regulators expect

Audit trails for AI agents - what regulators expect

Summary

Why is an editable log not evidence?

Article 12, translated

Count the clocks that start in one incident

Agents change what a trail has to prove

Let the workflow keep the records

About the author

Automate your workflows with Tallyfy