Summary
- Agents drift because context piles up, not because the model is weak - in a long run the original instruction gets outweighed by pages of tool output and intermediate reasoning, so the agent answers the most recent thing it read instead of the task you gave it.
- Production traces show the forgetting is real - Latitude’s failure analysis found agents that “forgot” a constraint set in turn 1 by turn 15, and the “Lost in the Middle” study from Stanford shows models use information at the start of a long context far worse than information stuck in the middle.
- A bigger context window makes it worse, not better - more room to fill means more noise competing with the goal. Capacity was never the thing that was missing.
- Externalize the goal into a process and drift has nowhere to start - the workflow holds the objective, the agent only ever sees one bounded step. See how Tallyfy structures that
Here’s what actually breaks when an AI agent runs a long job. It doesn’t get dumber halfway through. It loses the thread. Fifteen steps in, the instruction you gave it is one short line sitting at the bottom of a growing pile of tool results, half-finished reasoning, and error messages, and the model is reading the pile, not the line.
That failure mode has a name. Context drift: the agent accumulates so much intermediate junk that the goal stops being the dominant signal, and it quietly reinterprets the task based on whatever it read most recently. It reframes a fair bit of how AI behaves once it is doing real work, because the risk was never a wrong answer on step one. It’s a definition of the job that slowly wanders while every individual step still looks fine.
So why doesn’t a smarter model fix it?
Because the problem isn’t intelligence, it’s attention. The model can only weight what’s in front of it, and what’s in front of it is mostly the noise it generated on the way here. That’s worth sitting with before you hand an agent a twenty-step job and walk away.
It also explains why the failure feels so unfair. The agent did every step competently. You can read the trace and watch it reason well at each point, and the job still comes out wrong, because being right at every local step isn’t the same as staying pointed at the global goal. Competence per step and coherence across steps are different things. A long autonomous run quietly trades the second for the first, and you only notice once the final output is confidently aimed at something you never asked for.
Drift is a memory problem, not a model problem
Context drift happens because an autonomous agent carries its whole history forward. Every tool call it makes, every chunk of reasoning, every error it recovered from gets appended to the running context. By the time it reaches step 15, the original task is a thin slice of a very long document, and the model’s attention is dominated by the most recent few thousand tokens, which are all about the sub-problem it just solved. The goal didn’t change. Its share of the agent’s attention did. So the fix isn’t a better model or a longer window, because a longer window just gives the noise more room to grow. The agent needs the goal to live somewhere it can’t be drowned out, which basically means somewhere outside the model’s context entirely.
Workflow Automation Software Made Easy & Simple
That last point is the whole post, so it’s worth proving rather than asserting.
Why the goal loses to the latest tool output
Walk a long run forward and you can watch the goal lose ground. Latitude’s analysis of why agents break in production files this under context-window saturation: in long sessions the window fills up, “information from earlier turns gets truncated or lost,” and the agent starts “producing responses that contradict earlier decisions or miss constraints established at the start of the session.” Their reviewers found traces where the agent “forgot” an instruction set in turn 1 by turn 15. Same model, same task, just more history crammed in between.
Turns out there’s a deeper reason for this, and it’s been measured. The Stanford “Lost in the Middle” study, from Nelson Liu, Percy Liang, and colleagues, found that models use long contexts unevenly: “performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.” Your original instruction sits at the very beginning. Fifteen steps later it’s stranded in the middle of a wall of tokens, which is precisely where the model is worst at finding it. The task didn’t get harder. It slid into the model’s blind spot.
Making the window bigger doesn’t rescue the original instruction either, it demotes it further. A larger context means the goal is a smaller fraction of everything the model is weighing, and it sits even deeper in that vulnerable middle. You bought more room and spent all of it on noise. The instruction’s competition grew while the instruction itself stayed one sentence long, which is the opposite of what the “just use a model with a million-token window” pitch promises.
Recency wins, and the goal is never the most recent thing.
That single fact explains most drift you’ll see in production. The agent isn’t ignoring you on purpose. It’s doing what attention does, which is favor what’s close, and after fifteen steps your instruction is the furthest thing away.
Teams usually try to patch this without changing the structure, and the patches all sag the same way. The first instinct is to re-paste the goal into the prompt every few steps. That helps for a step or two, then the re-pasted goal becomes one more line in the pile, competing with everything else, and the agent drifts back to weighting the recent.
The second instinct is to summarize the context periodically to keep it short. Summaries drop detail by definition, so now the agent reasons over a lossy compression of its own history and quietly loses the constraint that mattered. The third instinct is to retrieve the original instruction back in whenever it seems relevant, which assumes the agent can tell when its goal is slipping. It can’t. Drift doesn’t announce itself, which is the whole problem.
None of those are dumb ideas. They’re the obvious moves, and they share one flaw: each leaves the goal inside the model’s context and then fights the model’s own attention to keep it visible. That’s a losing battle against the math. The goal has to live somewhere the context can’t outvote it, and the model’s context is the one place it always can.
What context drift looks like in a real run
Picture an agent running a routine procurement job. The task is plain: approve the standing purchase order for an existing vendor, the same one the team renews every quarter, as long as the terms still match. Early steps go fine. Then it hits a few line items that don’t line up, reads some supplier emails flagging a price change, and works through a couple of exceptions. By step twelve its context is stuffed with pricing disputes and back-and-forth negotiation language. Now it reaches the approval step. Instead of approving the standing PO it was sent to handle, it drafts a counter-offer and tries to negotiate the rate down.
Who told it to negotiate? Nobody did. The agent absorbed the tone of its recent context and swapped “approve this” for “push back on this” without a single error firing. The goal was never deleted. It got outvoted by the dozen messy steps that came after it.
You can find the same shape anywhere a job runs long. A research agent told to “pull the three cheapest vendors that meet our specs” reads forty pages of comparisons and ends up writing a balanced market overview nobody asked for, because the recent context was all analysis and the original ask was a short list. A support agent told to “tag this ticket and route it” reads a long, heated thread and drafts a full apology plus a refund, because the thread’s tone became its instruction. Different domains, identical failure. The goal was a single early sentence, and the run buried it under everything that came next.
Something we learned the hard way on our own systems is that an AI tool will hand you a confident, specific number and be wrong about it for exactly this reason. We’ve watched one report a dozen open items with total certainty when there were actually several hundred. It wasn’t lying. The data it received had been truncated, it only ever saw that slice, and it answered honestly about the only context it could see. Confident, precise, and completely wrong, because the context it trusted wasn’t the whole picture.
A process owns the goal so the agent doesn’t have to
So if a bigger model won’t hold the goal, what will? Something outside the model. A defined process keeps the objective in a place the context can’t bury, and hands the agent one step at a time. The agent never has to remember the whole job, because the whole job isn’t its responsibility. Instead, the workflow owns the goal, the sequence, and the definition of done, and the agent owns the current step and nothing else.
The thing is, that’s the structural fix for drift, not a workaround. When each step’s context is bounded by the step’s own definition instead of the entire run’s history, there’s nothing for the goal to lose ground to. Step seven doesn’t inherit the pile of pricing-dispute tokens from steps four through six. It gets a clean, narrow instruction and the handful of inputs that step needs. We built Tallyfy’s live status tracking and automation rules around keeping that state outside the model, because a process that already names every step and owner is a process the agent can lean on instead of holding the plan in its head.
Make that concrete. Without a process, at step seven the agent’s context is everything that happened in steps one through six: the documents, the tool calls, the dead ends, the goal somewhere up top. With a process, at step seven the agent gets a step definition that says what this step is, the two or three fields it needs, and the rule for done, and none of the six steps of history. It can’t drift toward the last step’s topic, because the last step’s topic isn’t in front of it anymore. The narrower the window the process hands over, the less surface there is for the goal to erode. That isn’t a prompt trick. It’s the difference between asking a model to remember and not asking it to remember at all.
Give the agent a smaller job and it has less room to forget.
This is the memory-side companion to the arithmetic. A 20-step agent already fails most runs on compounding errors alone, and drift is the same story told from the other direction: even the steps that don’t fail outright can quietly aim at the wrong goal. Both problems share a root and a fix, which is why an AI agent needs a workflow engine holding the state it can’t be trusted to keep.
Point the agent at one step, not the whole job
The takeaway is almost boring. Don’t hand an agent a long, open-ended assignment and trust it to remember what you meant fifteen steps later. Hand it one bounded step, with the goal held by the process around it, and let it do the thing models are genuinely good at: reading, classifying, drafting, judging a single input. Then move to the next step with a fresh, narrow context and do it again.
We have watched enough of these projects to call it early: the teams who beat drift never find a model that remembers better, they build a process that does the remembering for the model.
You can spot drift risk before you build, on a whiteboard, by counting two things about the job. How many steps will the agent run before a human or a check sees the output, and how much unrelated context will it read along the way? A three-step job that reads almost nothing rarely drifts. A twenty-step job that reads documents, calls tools, and handles exceptions is drift waiting to happen, no matter how sharp the model is. The longer the unsupervised run and the noisier the context, the more the goal needs to live outside the model. That’s the tell, and it shows up in the shape of the job long before a single line of code exists.
Drift isn’t a defect you can wait out. Next year’s smarter model will still carry its whole history forward, still weight the recent over the original, still lose the middle of a long context. Scoping the agent to one step is the no-brainer move precisely because it doesn’t depend on the model improving. What changes the outcome is whether the goal lives somewhere the model can’t bury it. Put the process in charge of remembering, and the agent is free to forget everything except the step in front of it.