The Frozen AI Problem: How Researchers Are Teaching LLMs to Keep Learning

The AI tools you use every day — Claude, ChatGPT, Gemini — are, in a meaningful sense, frozen in time. They learned everything they know during a multi-month training run, and then the learning stopped. New research published in 2025 and 2026 is attacking this problem from five different angles simultaneously. Here’s what’s actually working, what definitively isn’t, and what it means for the AI you’ll be using a few years from now.

66%Best model score on streaming fact-tracking (humans: ~95%)

54%GPT-5.4 failure rate on problems it previously solved — after building a memory

3×Efficiency gain from separating “fast” and “slow” learning

2MSequential knowledge edits handled by the new state-of-the-art

The Problem

Why Your AI Can’t Learn New Things

When you use Claude or ChatGPT, you’re talking to a model that went through a single enormous training run — consuming hundreds of billions of tokens of text over weeks or months — and then got locked. That frozen state is called its weights: billions of numerical parameters that encode everything the model “knows.”

After training, the model gets some additional tuning (teaching it to be helpful, honest, and safe), and then it’s deployed. From that point on, it doesn’t learn. Not from your conversations. Not from the news. Not from the papers published last week. The context window gives it temporary memory during a single session, but when you close that session, nothing persists. Nothing changes in the underlying model.

This is a fundamental limitation with real consequences. When ChatGPT was trained, it didn’t know about recent events, newly published research, updated medical guidelines, or the software library you just open-sourced. You can tell it these things in a message, but it won’t remember them next time, and it certainly won’t update its underlying knowledge from them.

An analogy that holds up

Imagine hiring someone brilliant who read every book ever written, but then had their learning centers surgically removed. They can reason from what they know, apply knowledge to your specific situation via the conversation (the context window), and even retrieve documents you give them (RAG). But they cannot genuinely learn something new and internalize it. Tomorrow they won’t remember today. And if an important new fact contradicts something they believe, they’ll keep believing the old thing.

That’s the current state of deployed LLMs.

Researchers call the technical problem “continual learning” — keeping a model learning across its deployment lifetime without destroying what it already knows. It turns out this is hard in ways that aren’t obvious. And in 2025-2026, the field has made more progress on it than in the preceding five years combined.

Why It’s Hard

The Forgetting Problem (It’s Worse Than You Think)

The naive solution seems obvious: just keep training the model on new data. The problem is what researchers call catastrophic forgetting. When you fine-tune an AI on new information, it doesn’t neatly tuck that information into an unused corner. It overwrites existing knowledge.

A concrete example: take LLaMA 2 (13 billion parameters), and fine-tune it on a new domain. Measure its performance on a math reasoning benchmark called GSM8K before and after. The result, documented in the TRACE benchmark paper? It drops from 28.8% accuracy to 2%. You taught it new things, but in doing so you essentially deleted its ability to do math.

Researchers now understand why this happens at a mechanistic level — a finding from January 2026 that clarifies a lot of the previous confusion.

What actually causes forgetting (2026 finding): In large language models, 67% of the attention mechanism’s key and query projections show negative gradient alignment between different tasks — meaning new learning actively fights against old learning in the most critical parts of the network. Layers 4–12 also show significant “representational drift”: the internal meaning of concepts shifts as new knowledge is absorbed. This isn’t a capacity problem. The model has plenty of room. It’s a geometry problem — the directions in weight space that represent old knowledge get overwritten by the directions needed for new knowledge.

Understanding the mechanism changed the approach. Most of the best 2025-2026 work tries to solve the geometry problem rather than fighting it head-on.

Six Attack Vectors

What Researchers Are Trying

1. Build a Separate Memory Module That Learns at Runtime

The most architecturally ambitious approach comes from Google DeepMind. Their paper “Titans,” published in January 2025, proposes adding a distinct long-term memory module to the transformer architecture — one that actually updates its parameters as it reads text.

The key innovation: this memory module uses a “surprise” signal (how unexpected is this token given what came before?) to decide what’s worth remembering. High-surprise content gets written more strongly into the memory weights. Low-surprise content (things the model already knows well) barely registers. The memory module also has a built-in forgetting mechanism that functions a lot like human memory — old, rarely-accessed patterns fade.

Google DeepMind · January 2025 · arXiv:2501.00663

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, Vahab Mirrokni add a neural long-term memory module to transformers that trains via mini-gradient descent during inference — no separate training required. Three architectural variants for different use cases.

Needle-in-haystack at 16k tokens: 97.4% vs. 5.4% for nearest baseline Scales to 2M+ token context windows Outperforms GPT-4 on BABILong reasoning at ~70x fewer parameters

The result is a model that can handle context windows over two million tokens long — about the length of ten novels — while maintaining accurate recall throughout. That’s not just a quantitative improvement over today’s systems; it’s a qualitative change in what AI-assisted work looks like.

2. Make Existing Architecture Weights Dynamic at Inference Time

A different approach: don’t add new components. Instead, take parts of the existing architecture that were previously static and let them update during inference.

This is the idea behind “In-Place Test-Time Training,” which won an oral presentation (the highest distinction) at ICLR 2026 — the most competitive AI research conference.

ICLR 2026 Oral · April 2026 · arXiv:2604.06169

In-Place Test-Time Training

Guhao Feng and colleagues repurpose the final projection matrix of MLP blocks — a component that exists in every transformer — as “fast weights” that update during inference. They also discovered that the learning objective used by prior test-time training methods was theoretically near-useless for language modeling, and replaced it with one aligned to next-token prediction.

4B-parameter model outperforms larger models on 128k-token contexts Drop-in enhancement: works on existing LLMs without retraining Negligible compute and memory overhead

The theoretical finding buried in this paper is striking: they mathematically proved that what previous test-time training methods were optimizing for had “negligible expected effect” on making the model better at language tasks. Five years of prior work was built on a flawed objective. The right objective, it turns out, is simpler and more effective.

3. Separate “Fast Learning” from “Slow Learning”

Researchers at Databricks and UC Berkeley published what may be the most practically important paper of this period in May 2026. Their insight comes from cognitive science: humans appear to learn at two different timescales simultaneously — fast, volatile working memory and slow, stable long-term memory. Their framework applies this to LLMs.

Databricks / UC Berkeley · May 2026 · arXiv:2605.12484

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Rishabh Tiwari and colleagues separate task-specific information into “fast weights” (optimized text prompts that absorb task knowledge) and “slow weights” (model parameters that stay close to the base model). Fast weights update readily from feedback; slow weights update rarely and deliberately. The key twist: “fast weights” aren’t neural weights at all — they’re optimized text.

3× more sample-efficient than standard RL-based training 70% less drift from base model at matched performance levels Models retain plasticity: successfully learn second task where RL-only models fail completely

The counterintuitive finding: RL-only training (the standard approach to improving models via reinforcement learning) makes models worse at learning new things over time. After enough RL training, the model’s weights are so far from the base model that introducing new task information causes collapse. The fast-slow framework avoids this by keeping the model anchored near its original state while letting context absorb task-specific signal.

4. Make Precise Surgical Edits to Specific Knowledge

Rather than updating a model’s general knowledge continuously, what if you could surgically change specific facts? Tell the model “The CEO of Company X is now Person Y” and have it update that specific belief without touching anything else?

This is the “model editing” or “knowledge editing” research thread, and it’s seen significant advances in 2025-2026.

May 2025 (revised 2026) · arXiv:2505.14679

UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing

Xiaojie Gu and colleagues compute a targeted parameter shift in one step — no separate training process, no external memory store, no model-specific optimization. A “lifelong normalization” strategy tracks internal statistics across sequential edits to prevent interference between updates.

7× faster than prior state-of-the-art 4× less GPU memory required Tested at 2 million sequential edits

The UltraEditBench dataset they published alongside this paper contains over two million editing pairs — ten times larger than any prior benchmark. The field went from “can we edit a model at all” to “can we do it two million times in a row” in about three years.

An interesting variant: LightEdit (April 2026) takes a completely different approach: instead of modifying the model’s weights, it uses a decoding strategy that suppresses the outdated knowledge during generation — the old belief is still in there, but it’s overridden at output time. Think of it like crossing out a word in a document rather than erasing it. It’s computationally lighter and gets competitive results.

5. Design Architecture So Different Knowledge Can’t Overwrite Each Other

The most fundamental approach: fix the write geometry problem at the architectural level, so continual learning becomes structurally impossible to mess up.

May 2026 · arXiv:2605.15053

TFGN: Task-Free, Replay-Free Continual Pre-Training

Anurup Ganguli proposes a “read/write decomposition” overlay for transformers. The forward pass (reading) stays completely normal. But the backward pass (writing/learning) is restructured so that updates for different knowledge domains are forced into orthogonal subspaces — they literally cannot overlap.

Backward transfer of −0.007 (near zero forgetting) Standard fine-tuning on the same data: −0.374 (51× more forgetting) Positive cross-domain transfer: training on Python improved JavaScript performance by 26.8%

That last result is the surprising one. Not only does the approach prevent forgetting — it actually produces forward transfer: training on Python made the model better at JavaScript, a related but distinct language, without ever seeing JavaScript. The shared structure in the subspaces carries useful signal across domains.

The paper also documents a vivid failure mode of standard fine-tuning: after training on Python code, a standard model starts inserting Python syntax into prose text mid-sentence. TFGN never does this.

6. Teach the Model to Be Its Own Teacher

All five approaches above either change the architecture or change how weights are updated at inference time. MIT’s Shenfeld et al. took a different angle: change how learning itself works, using a trick hiding in plain sight inside every capable language model.

The insight: large models already know how to use demonstrations in context. If you show Claude an example of how to solve a chemistry problem and then ask it a new chemistry question, it implicitly adapts its output style and reasoning to match the demonstration — that’s in-context learning. SDFT asks: what if we used that ability as a training signal?

MIT · Improbable AI Lab · January 2026 · arXiv:2601.19897

Self-Distillation Enables Continual Learning (SDFT)

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal split the same model into two roles using only context conditioning. The teacher is the model shown both the task input and an expert demonstration — it uses in-context learning to approximate the optimal next response. The student is the same model shown only the task input — the version that will actually be deployed. Training minimizes the gap between them, but crucially on examples the student itself generates (on-policy), not on fixed pre-collected data.

Knowledge acquisition: 89% strict accuracy vs. 80% for standard fine-tuning Nearly matches oracle RAG (91%) — a parametric method vs. perfect retrieval OOD generalization: 98% vs. 80% for standard fine-tuning Preserves reasoning depth even with answer-only supervision

The “on-policy” part is the critical ingredient — and the paper proves this cleanly by running the same teacher model in offline mode (generating training data from fixed examples rather than the student’s own outputs). Offline with the same teacher still loses to on-policy SDFT. It’s not just about having a good teacher; it’s about anchoring the update to where the student currently is.

Why the on-policy part matters

Imagine training a pianist by showing them recordings of Horowitz. Standard fine-tuning is like making them copy Horowitz note-for-note from the recordings — the gap between their current ability and the target is enormous, so they strain and their other pieces suffer. SDFT is more like: play the passage your own way first, then let Horowitz’s style subtly reshape what you just played. The update starts from where you actually are, so the rest of your repertoire doesn’t get disrupted.

A few results stand out as genuinely striking. When fine-tuned with answer-only supervision — no chain-of-thought traces, just final answers — standard fine-tuning causes reasoning collapse: the model’s average response length drops from 4,612 tokens to 3,273 and accuracy falls from 31% to 23%. SDFT under the same conditions preserves reasoning depth (4,180 tokens) and reaches 43.7% accuracy. The demonstration-conditioned teacher carries reasoning patterns implicitly, even when the supervision signal doesn’t.

There’s also a hard limitation worth naming: the method requires strong in-context learning ability to work. At 3 billion parameters it actually underperforms standard fine-tuning — the teacher is too weak. The method only becomes reliably better at 7B+, and the gains widen at 14B. This makes it a technique for frontier-scale models, not a universal solution.

What Doesn’t Work

The Negative Results (These Matter)

The most useful findings in this literature aren’t the wins. They’re the papers that definitively closed doors that everyone thought were open.

The Memory Problem: Building Notes Doesn’t Work As Well As Keeping Notes

One popular approach to giving AI “persistent memory” is to have the model distill its experiences into a textual memory bank that it updates over time. Systems like MemGPT/Letta do this: after each session, the AI rewrites its memory notes to incorporate what it learned. Intuitively this seems right — isn’t this what humans do?

A May 2026 paper ran the experiment rigorously and found something troubling. Memory quality follows an arc: it rises at first as the model accumulates experience, then degrades, eventually falling below the no-memory baseline entirely.

The striking finding: Even when researchers fed GPT-5.4 ground-truth correct solutions and had it write memory notes from those solutions, the model subsequently failed on 54% of problems it had previously solved correctly without any memory. The act of converting correct solutions into memory notes corrupted the information. The same raw trajectory information, kept unprocessed, outperformed all the condensed-memory systems tested.

The implication: LLMs are not good at compressing their own experience into reusable lessons. The compression step introduces errors that compound over time. The better approach, at least for now, may be keeping raw episode logs and retrieving them verbatim rather than distilling them.

Even the Best Models Fail at Tracking Changing Facts

A March 2026 benchmark paper called OAKS tested 14 leading models on a deceptively simple task: read a long document where facts change over time, and answer questions correctly about the current state of those facts.

Imagine a document that says “Alice is the CEO” at page 1, then “Bob replaced Alice as CEO” at page 50, then “Carol replaced Bob” at page 120. At the end, who is the CEO? This requires not just reading comprehension but active state-tracking — updating internal beliefs as facts evolve.

66%Best model (Gemini): OAKS-BABI accuracy

33%Average open-source model accuracy

38%Rate of correctly detecting a state transition

32%Rate of updating when nothing actually changed

The last two numbers are particularly revealing. Models could detect that a change occurred about 38% of the time. But 32% of the time they also “updated” on text that didn’t represent any change at all — confusing mention of an old fact for an update. Change detection and correct updating are essentially independent problems, and models are failing at both.

Enabling “thinking mode” (chain-of-thought reasoning) barely helped with the core state-tracking problem — it improved multi-hop reasoning but not the fundamental issue of keeping up with changing facts. RAG retrieval made things no better, and the agentic memory systems (HippoRAG, MemAgent) actually underperformed naive approaches.

Self-Improving Agents Keep Breaking Their Old Abilities

A related paper studied what happens when AI agents are allowed to continuously improve themselves — updating their own workflows, skills, and memory based on experience. The finding: all four types of self-improvement degrade previous capabilities. Agents get better at new tasks but worse at old ones in a consistent, non-monotonic pattern.

The most destructive form of self-improvement turned out to be workflow evolution — agents adding steps to their own execution procedures. Unconstrained, this produces bloated workflows that collapse under their own complexity. The paper proposes a stabilization framework (CPE) that constrains adaptation to prevent this drift, and importantly shows that the constraint improves both retention and new-task performance — the stability-plasticity tradeoff isn’t always a tradeoff.

How to Think About This

What This Means for the AI You’re Using

If you’re a power user of Claude, ChatGPT, or similar tools, here’s the practical read on where things are and where they’re going.

What’s essentially solved at research scale

Adapting a model to a new domain without destroying other knowledge (7B–9B models)
Surgical edits to specific facts, up to millions of sequential edits
Theoretical understanding of why forgetting happens
Long-context (1M–2M token) recall via architectural memory

What remains genuinely unsolved

Tracking facts that change in real-time streams (all current models fail this)
Per-user personalization that persists and improves without data privacy tradeoffs
Self-managed memory that improves rather than degrades over time
Learning that accumulates across sessions without a scheduled retraining run

The immediate practical implication: the current generation of “memory” features in AI products — where a system summarizes your preferences and injects them into future conversations — is built on a foundation that the research literature is now questioning. LLM-written memory notes degrade over time. Verbatim retrieval may be better.

The medium-term implication is more interesting. The fast-slow learning framework from Databricks/UCB (treat optimized prompts as “fast weights,” model parameters as “slow weights”) provides a roadmap for systems that adapt usefully to individual users and domains without the catastrophic forgetting problem. The key insight — that task-specific knowledge should be absorbed by context, not parameters — is architecturally clean and practically deployable. Expect to see it in production systems within 12-18 months.

“There’s no good reason for restricting learning to being in-context or in-weights.”
— Tiwari et al., Databricks / UC Berkeley, 2026

The Titans architecture (neural long-term memory that updates at inference time) points toward something more fundamental: blurring the line between training and inference. If a model can run small gradient updates on its memory module while simply reading your input, the distinction between “training” and “using” starts to dissolve. This is architecturally promising but has a real safety question: an AI whose weights change during inference is harder to align and audit than one that stays static.

The Bigger Picture

The Architecture of Memory That’s Emerging

Step back and look at all the successful approaches together. A pattern emerges that maps surprisingly well to how human memory actually works:

Memory type	Human analog	AI implementation	Timescale
Working memory	What you’re thinking about right now	Context window	Single session
Episodic memory	Raw memories of specific events	Verbatim episode logs (retrieval)	Persistent, no compression
Fast adaptation	Procedural memory for new tasks	Optimized prompt / fast weights	Days to weeks
Semantic memory	General knowledge, integrated over time	Long-term memory module (Titans)	Ongoing via inference-time updates
Consolidated knowledge	Deeply learned skills that are automatic	Model weights (slow training)	Months, via scheduled runs

The research suggests that trying to collapse all these into a single mechanism — as most current AI products do — is a mistake. The fact that your notes-app AI tries to summarize your preferences into a paragraph and feed it back to you next session conflates episodic and semantic memory in a way that produces mediocre results at both.

The systems that will work — the ones the research is clearly pointing toward — will maintain raw episode logs for faithful recall, use fast prompt optimization for task adaptation, and reserve parameter updates for slow periodic consolidation of genuinely durable knowledge. It’s a more complex plumbing job, but the cognitive science suggests it’s the right architecture.

Bottom Line

What to Watch For

The “frozen AI” problem has a clear solution path now. That’s new. A year ago, the field had many competing approaches and no consensus on which geometry of the problem was even tractable. The 2025-2026 literature converges on a few ideas that are robust:

Separate fast and slow learning. Task-specific knowledge shouldn’t be baked into parameters — it should live in context or lightweight fast weights that can be updated and discarded without corrupting the underlying model.
Keep raw episodes, don’t compress them. LLMs are bad at summarizing their own experience in ways that preserve fidelity. The instinct to “clean up” memory notes actually destroys information.
Track changing facts explicitly. Current models fail badly at state-tracking across long documents. This is a solvable problem but nobody has solved it yet — the OAKS benchmark results are damning across the board.
Surgical edits are now reliable at scale. For cases where you need to update a specific fact (a product name changed, a guideline was revised), model editing approaches like UltraEdit are ready for production.

The timeline for users experiencing these changes: the fast-slow framework and better episodic memory management could show up in products within a year. Architectural changes like Titans require new training runs from scratch and are at least 18-24 months from anything most people will use. The streaming knowledge-tracking problem — making an AI that genuinely keeps up with the world — is probably three to five years out.

But the direction is clear. The frozen AI isn’t a permanent condition. It’s an engineering problem with a research solution that’s now far enough along to have a plausible timeline.

Papers referenced in this piece:
Titans · In-Place TTT · Learning Fast and Slow · UltraEdit · TFGN · Self-Distillation Enables Continual Learning · OAKS · Useful Memories Become Faulty · Do Self-Evolving Agents Forget? · LightEdit

Why Your AI Can’t Learn New Things

An analogy that holds up

The Forgetting Problem (It’s Worse Than You Think)

What Researchers Are Trying

1. Build a Separate Memory Module That Learns at Runtime

2. Make Existing Architecture Weights Dynamic at Inference Time

3. Separate “Fast Learning” from “Slow Learning”

4. Make Precise Surgical Edits to Specific Knowledge

5. Design Architecture So Different Knowledge Can’t Overwrite Each Other

6. Teach the Model to Be Its Own Teacher

Why the on-policy part matters

The Negative Results (These Matter)

The Memory Problem: Building Notes Doesn’t Work As Well As Keeping Notes

Even the Best Models Fail at Tracking Changing Facts

Self-Improving Agents Keep Breaking Their Old Abilities

What This Means for the AI You’re Using

What’s essentially solved at research scale

What remains genuinely unsolved

The Architecture of Memory That’s Emerging

What to Watch For

Leave a Reply Cancel reply