The Kilburn boundary
Tom Kilburn + LLMs
Tom Kilburn ran the first program stored in electronic memory on the Manchester Baby in 1948. The machine’s working memory was so small that the program succeeded only because the algorithm was written to keep the working set tiny: it advanced step by step, never needing to “hold” very much state at once.
We spend hundreds of millions training language models with hundreds of billions of parameters. The default story is linear: more parameters, more data, more compute, more intelligence. For hard tasks, that’s broadly right. But there’s a different lever that matters more than we admit: shrinking the working set of cognition so the model only has to do one kind of mental operation at a time.
Call it cognitive paging: a design pattern where you externalize intermediate cognitive state and swap it in and out across model calls, so limited-capacity models can reliably execute workflows that would otherwise exceed their effective “working memory.”
Scaffolding as a working-set reduction
Take a small model, say 1–3B parameters, quantized so it fits on a laptop. Ask it to analyze a codebase and fix failing tests in one shot. It tends to fail in a familiar way: hallucinated function names, confident edits to files that don’t exist, incorrect assumptions carried forward because the model can’t keep enough constraints in mind at once.
Now constrain the working set. Don’t ask it to plan, execute, and self-correct in a single forward pass. Make those phases explicit, and make state explicit too:
plan = model.generate_plan(prompt)
steps = decompose(plan)
results = []
for step in steps:
out = model.run(step)
results.append(out)
critique = model.evaluate(results)
final = model.synthesize(results, critique)The weights haven’t changed. The “intelligence” hasn’t increased. But the behavior changes anyway, because you’ve shifted the bottleneck.
In the one-shot version, the model must simultaneously:
decide what to do,
do it,
notice when it went wrong,
recover without losing track of the original objective.
That’s a large working set.
In the scaffolded version, the model only has to be decent at each operation in isolation, while the orchestrator holds the continuity.
That’s the trick: cognitive paging reduces simultaneous constraint load and enables tighter control loops.
This sits inside an existing family of orchestration patterns
This idea appears in planner-executor-critic loops, ReAct-style reasoning with tool use, Reflexion-style self-critique, and Tree-of-Thought style branching search. What I’m isolating is a specific design constraint: explicitly minimizing the model’s simultaneous working set while externalizing invariants and state continuity.
The novelty, if any, is the discipline about what must live outside the model.
A minimal definition:
Inside the model: local transformations (generate a plan, execute a step, critique an output).
Outside the model: state continuity (what has been tried, what succeeded, what failed, what must remain true).
The paging operation itself: compress, store, and retrieve state so the model never has to carry the full task context at once.
Once you view it this way, the orchestrator is the real brain of the operation. It maintains invariants over time: “don’t edit nonexistent files,” “don’t regress passing tests,” “keep changes minimal,” “verify before claiming success.”
Large models learn to maintain some of those invariants implicitly. Small models don’t. Paging makes the invariants explicit.
Where it helps, and where it doesn’t
Paging helps when the limiting factor is cognitive load rather than raw capability.
A lot of tasks come down to procedural follow-through:
read an error,
locate the source,
apply a mechanical fix,
rerun the test,
iterate.
In that regime, you want a worker who doesn’t lose their place.
Paging keeps the worker from losing their place.
But the failure mode is equally clear: if the model’s individual steps are low quality, decomposing the workflow doesn’t fix that, it multiplies it. A mediocre plan feeds mediocre execution feeds mediocre critique. Architectural structure can reduce cognitive load, but it can’t manufacture understanding.
This is why paging feels like it “shouldn’t work” and then does, until it doesn’t.
Lossy state
Paging introduces its own tax: compression.
If your orchestrator stores “memory” by repeatedly summarizing what happened, you get a predictable failure: the state becomes a cartoon.
def compress_memory(state):
summary = model.summarize(state["history"])
state["memory"] = summary
state["history"] = []
return stateSummaries preserve facts reasonably well. They destroy nuance aggressively.
After a few cycles, the model is no longer reasoning about the original problem. It’s reasoning about a simplified, smoothed, and sometimes subtly wrong represention of it.
This is the core trade:
Paging reduces working set requirements.
Paging increases the risk of lossy state drift.
If you want paging to work beyond trivial tasks, you can’t treat memory as “a summary.” You need memory that preserves constraints and invariants, not narrative.
In other words: calling the model multiple times is easy. Designing the state is where you actually have to think.
For example, instead of storing “memory” as a running summary, store a structured constraints ledger:
state = {
"objective": "All tests pass",
"invariants": [
"Do not edit files outside /src",
"Do not modify public API signatures",
"Do not regress passing tests"
],
"attempt_log": [],
"verified_facts": [],
}Each step appends to attempt_log with inputs, diffs, and test results. Invariants are re-checked mechanically after every modification. Verified facts are promoted only after being confirmed by tool output, not model assertion.
That structure preserves constraints and decisions explicitly. It avoids the failure mode where repeated summarization turns the task into a softened narrative. The model performs local transformations; the ledger preserves truth conditions.
Kilburn’s boundary
A ~version of what I think is actually true:
There is a boundary between tasks where scaffolding buys you more than scaling, and tasks where scaling is non-negotiable.
This is the Kilburn boundary: the line between problems that are mostly procedural and problems that are representation-heavy.
On the procedural side, success is about not losing the thread: sequencing, checking, iterating, verifying.
On the representation-heavy side, success depends on the richness of internal models: deeper domain knowledge and better judgment under ambiguity.
Kilburn’s program worked with microscopic memory because the procedure was simple and the state needed at any moment was small. Much of day-to-day software maintenance lives closer to that side than we like to admit: running tests, following stack traces, applying mechanical edits, keeping diffs tight, repeating until green.
Paging can push a small model surprisingly far there, because the orchestrator carries the continuity and the model provides local competence.
But there are tasks where the boundary asserts itself immediately: novel design work, tricky refactors, API reasoning across large surfaces, anything that depends on subtle semantic constraints. In those cases, you can’t scaffold your way into depth. You need a biggger model.
In a small set of coding tasks, mostly “fix failing tests” style work, paging reliably improved results for a small local model. It didn’t turn it into a frontier model. It shifted the failure distribution.
Where the workflow was straightforward and success depended on diligence, the scaffold helped: fewer hallucinated file edits, more consistent iteration, more “verify then claim.” Where the task required creative leaps or deep contextual judgment, the scaffold mostly made the model fail more neatly: plans that looked structurally correct but were substantively shallow, like a student who learned the format of a lab report without understanding the experiment.
That’s the practical takeaway: paging changes what small models are useful for, not what they are.
So what’s actually being asked?
Small models don’t replace big ones. Nobody’s arguing that.
The better question: how much of real work is gated by representation depth, and how much is gated by working-set management? And related: how far can you push the Kilburn boundary with better state design, with constraint-preserving memory, tighter invariants, more rigorous verification loops?
The Manchester Baby’s memory could fit in a QR code. Kilburn still made it run real computation by respecting the working set.
We’re in a similar moment with models...the constraint is what forces you to build the coordination layer that should have existed all along.

