The wish, and the wall
It started with a deceptively simple ask: a “nano” coding model that runs in about 100MB of RAM and can build whole projects from a prompt. The honest answer is that those three things — tiny, from-scratch, one-shot whole projects — can't all be true at once. One-shot project generation is the single hardest thing LLMs do, and the models that do it acceptably are billions of parameters. A 50M-parameter model trained from scratch will produce code-shaped text that never holds a project together. That's not a tuning problem; it's a capacity wall.
But underneath the spec was a real goal: how does someone on a weak, offline machine get genuine AI coding help? That one is solvable. The constraints turned out to be an 8GB Intel MacBook with only ~512MB free, no reliable internet, CPU-only inference. We landed on Qwen2.5-Coder (0.5B and 1.5B, 4-bit) running under llama.cpp — about 460MB and 1GB resident respectively, generating at a glacial ~1.5–2 tokens/second on that CPU.
At that speed, “build me an app” is hopeless if you ask the model to do it directly. So we didn't.
The idea: make the loop smart so the model can be dumb
The breakthrough framing: never ask the small model to architect anything. Instead, a harness holds the project structure deterministically and asks the model to fill one tiny, testable slot at a time — “write this one function so this test passes.” That, a 1.5B model can do.
We called it NightBuilder. Each function slot carries its own self-contained test. The loop:
- generates a tiny function (hard token cap, stop sequences so it can't ramble),
- runs its test in a fresh subprocess,
- on failure, feeds the exact error back and retries,
- assembles only the verified pieces — failures become clearly-labeled stubs.
You wake up to working code plus a short TODO list, never a hung process or an empty folder. On its first real run it built 6 of 7 functions of a CLI todo app — correct, idiomatic, verified — in about seven minutes.
Two speed lessons worth more than a bigger model
Two findings did more for throughput than any model swap could:
-t 2, not-t 4. The CPU has two physical cores; running four threads made the hyperthreads fight over the same math units. Dropping to two threads gave a 60% faster prompt pass for free.- One stop sequence (
\n\n). Without it the model rambled to the token cap after finishing a function — minutes wasted at 1 tok/s. With it, generation stopped the instant the function ended: roughly 10× fewer wasted tokens.
The biggest variable of all turned out to be background apps. Generation speed swung 4× depending on whether a browser was open. The cheapest optimization for an overnight job is an idle machine.
The real experiment: a loop that improves itself
Then the goal got more ambitious — build projects overnight, and get better at building them while it runs. The thesis to test: smarter loops beat bigger models. Instead of upgrading the model, improve the loop each iteration and keep only what measurably helps.
We built a champion/challenger meta-loop. Each tick copies the best-known configuration, mutates the copy (decompose a stuck function, tune retries, inject a learned lesson), rebuilds, and promotes the challenger only if it scores at least as well — by completion ratio, not raw count. A bad idea simply gets discarded; the champion only ever moves forward. Everything is checkpointed to a “vault” every single tick, so a crash at any moment still leaves the best result plus a full history. Crucially, the loop is never allowed to edit its own evaluator — so a bad self-edit can't blind the thing that measures success.
It also keeps a lessons file — distilled rules pulled from failures (“task records are dicts; use task['done'], not task.done”) — injected into every future prompt. Knowledge compounds across the night. And a slot cache means solved functions are reused for zero tokens, so every tick spends its whole budget on the unsolved frontier.
Where it broke — and what that taught us
The loop hit a wall on one function: a five-branch command dispatcher. The 1.5B model thrashed — a different error every attempt. More retries didn't help, because the model was simply above its per-slot ceiling for that piece.
The fix is the whole thesis in miniature: don't reach for a bigger model — make the slot smaller. We added recursive decomposition. When a decomposed function still fails, split it again — the dispatcher became four one-line handlers (each ~16 tokens) plus a trivial router. Each leaf fell back below the model's ceiling, and the model nailed them first try.
The deepest finding surprised us. Every single time the model failed, the fix was a clearer contract or a smaller piece — never a cleverer retry. Vague docstrings made the model over-engineer (adding validation nobody asked for) or get an argument order wrong. The loop's real value isn't squeezing more out of the model; it's two things: it decomposes until the pieces are trivial, and it makes ambiguous specifications legible by turning them into concrete test failures you can read.
Why it matters
We set out to shrink the model and ended up redistributing the intelligence. A 1.5B model on a decade-old laptop, fully offline, can't build software — but a 1.5B model inside a loop that decomposes, verifies, learns, and never regresses can. The capability didn't come from parameters; it came from structure.
That's an encouraging direction for anyone who thinks “useful AI” requires a data center. For a large class of well-decomposable work, the lever isn't a bigger model — it's a smarter loop wrapped around a small one, doing real work on a machine you already own, with the lights off and the Wi-Fi unplugged.