Teaching Machines to Remember

Every team building with AI hits the same wall: the model doesn’t know your code.

The instinct is to teach it. Fine-tune on your codebase, update the weights, make the model know your architecture as it knows Python’s standard library. No retrieval latency, no context-window juggling. Just a model that understands your system.

It doesn’t work.

The Theory

Sean Goedecke argues the barrier to continuous learning isn’t the mechanism; it’s automation. Weight updates at runtime use the same gradient descent as post-training. Labs already do this between releases. The machinery exists.

Reliable quality control does not. Training is noisy: same setup, different random seed, different results. Without human oversight at each step, continuous learning degrades the model rather than improving it. Fine-tuning on specific codebases has, in Goedecke’s words, “fizzled out.”

The problems compound. Prompt injection is bad; weights injection is worse. An attacker who influences training data creates persistent backdoors baked into the model, not just a single context window. Knowledge encoded in weights can’t transfer to the next model generation. Users who invest in teaching a model become locked to that checkpoint.

Goedecke’s conclusion: humans in the loop can do continuous learning today. That’s what RLHF releases already are. Remove the human and it falls apart.

The Practice

A HuggingFace thread tells the same story from a practitioner’s workbench.

A developer fine-tuned Qwen2.5 on a proprietary codebase, generating Q&A pairs from individual code files using Qwen3-235B. Training completed. The model produced “largely irrelevant” responses, even for questions drawn directly from the training data.

The diagnosis: single-file analysis fragments context. The model learned syntax but never grasped architecture: how files depend on each other, how APIs interact, how the system works. Synthetic Q&A generated by one model to train another compounded errors. Pre-training on raw code showed “no noticeable improvements.”

Two things worked:

Retrieval-augmented generation. Index the code, retrieve context at query time, feed it into the prompt. The model doesn’t need to know your codebase. It needs to see the relevant parts.
Human-written documentation. Internal docs, written by people who understood the system, transferred well. Auto-generated Q&A did not.

The Pattern

Theory and practice converge:

Goedecke (theory)	HuggingFace (practice)
Fine-tuning on codebases has “fizzled out”	Fine-tuning produced “largely irrelevant” results
Quality control on auto-learned weights is intractable	Synthetic Q&A compounded errors
Context-window injection is more reliable	RAG outperformed fine-tuning
Human curation matters	Human-written docs worked; auto-generated Q&A didn’t

The practitioner’s journey (try fine-tuning, fail, pivot to retrieval) confirms the theory.

What This Means

Knowledge that lives outside the model beats knowledge baked into it:

Auditable. You can read what the system knows and correct it.
Updatable. Change a document, change the behavior. No retraining.
Portable. Switch models and the knowledge persists.
Safe. Poisoned records can be found and deleted. Poisoned weights cannot.

The tradeoff is retrieval precision. A fine-tuned model would just know things. A retrieval-augmented model needs the right context to surface at the right time. That’s a search problem, not a training problem. Search problems yield to deterministic tools.

Invest in your retrieval infrastructure, not your training pipeline. Write good docs. Build good indexes. The model doesn’t need to learn your codebase. It needs to see it.