Jun 24 2026

Never Loop Without Verifiers

Loops are the least surprising thing to happen all year. They are obvious, they have existed for years, and the underlying trick is not new. AutoGPT, BabyAGI, Ralph loops, auto-research scripts, and every janky bash loop with a retry condition were already pointing at the same shape.

History

In March 2023, AutoGPT took a goal, decomposed it into subtasks, and looped until it decided it was done. It became one of the fastest-growing repos in GitHub history, then almost immediately turned into a cautionary tale about autonomy without verification. Early loops wandered, hallucinated, repeated itself, and occasionally set your OpenAI bill on fire. BabyAGI followed weeks later and hit the same wall: long-running agents compound small errors, lose the plot, and confuse motion for progress.

Ralph loops revived the idea with more discipline by bringing tests, builds, and external checks into the loop. In that sense, Codex’s/goal and Claude Code’s/loop are not magical new primitives, but are old loops wrapped in a better interface, pointed at better verification.

So why now?

So why are loops becoming popular now? For years, the AI was a brain with no body: it could plan, retry, and narrate, but it could not reliably see what happened after it acted. Then several things landed close enough together to make a classic engineer sit up and mutter, with real spiritual fatigue, “it’s so over for me.”

Eyes. The agent can now see. Multimodal models and stronger computer use let screenshots, rendered pages, CAD previews, and UI states become inputs. The loop can inspect its own output instead of guessing through text-only proxies like tests, logs, or judge-model summaries. Now the agent can open a browser, click around, inspect what it built, and correctly stay on track.
Hands. Bash, MCPs, CLIs, plugins, and harnesses gave the agent reach. It can move through GitHub, Notion, Slack, terminals, browsers, and the strange internal systems every company pretends are normal.
Memory. Somewhere along the way, context windows got big enough that memory became a given. At this point, most of us have forgotten context engineering was ever a topic worth sweating
Brain. The agent can now think. Reasoning models got better, trained in RL environments and given far more data on agentic computer use.

Proof Is the Whole Game

But fine, yes, what is a loop? A loop is a repeated cycle where an agent autonomously takes actions until it hits a verifiable goal, or until you stop it. The keyword isn't autonomous and it isn't AI. It's verifiable. The heart of building a good and effective loop is verification.

Old verification was mostly textual and binary: did the test pass, did the benchmark clear the bar, did the judge-model approve? That was clean, but it only covered tasks that could be flattened into strings. The new frontier is verification for work that used to require a human pair of eyes:

does the rendered page match the mockup down to the spacing
does the form submit when you click it
does the dropdown actually open
does the animation land or stutter
does the diff change only what it was supposed to (no regressions)
does the exported file open without errors

Here’s an example of a loop with visual feedback for verification that does not require a human to sit there supervising every pass. My goal is to take an image of a real object and turn it into structured CAD instructions for the 3D printer, effectively cloning the object from a photo. In this case, I used Gemma 4, Google’s latest open model, running on Cerebras.

In this run, every Gemma 4 loop produced a new STEP file in about 1.2 seconds, running at roughly 1,500 tokens per second. That is fast enough for the agent to treat iteration as cheap instead of precious.

Let’s break down the loops prompt that I used:

That’s a lot of words, but we can build up your intuition so you can /loop effectively!

Here is my starting image and initial Gemma 4 prompt, as well as what Gemma 4 created on the first pass.

Here’s the Gemma 4 prompt and 3D rendering output that my loop independently created roughly five loop iterations later.

and here’s a timelapse of the 3D printer creating the dumbbell: the print is smooth, structurally sound, and looks nice.

The part worth noticing is that the loop rewrote its own prompt to get there. It looked at the render, saw what was missing, and revised the instructions five times without me. That same look-compare-fix cycle doesn't care what it's pointed at. Now, context lengths are so long and verifiers are so good that you can have it work against an objective on the order of complexity of cloning entire web apps.

Where Good Loops Go Broke

But we still have a ways to go. A loop is only as good as the goal you hand it, and there are two ways to hand it a bad one.

Spiralling. The loop never learns when it is done, so it keeps going, and going, long after the work was finished, on your dime.
Cheating. The loop does exactly what you asked and nothing you wanted.

Spiralling is the symptom of a broken loop. With no definite end state, the loop has no way to know it's finished, so it keeps "improving" in circles while the token meter runs. Verification is a strong solution, where you want to keep the loop on track so that you are not wasting tokens and are still completing your goal efficiently.

Cheating is a prompting problem, and a slipperier one. Vague prompts will get gamed. Different models cheat differently and some follow instructions better than others, but the fix is always the same:

Be annoyingly specific about what counts as done
Explicitly name the shortcuts you're forbidding before the model goes looking for them.

For example, a vague prompt like

could result in the checkout test passing, but a previously fixed bug getting reverted. Instead, a better prompt might be:

Another example is training a model to perform well on a benchmark. A simple prompt like:

A smart, sneaky model might download Terminal-Bench and train on the benchmark itself. The score goes up while the actual capability does not. A better prompt explicitly marks the eval set as radioactive:

Off-limits: you may not train on Terminal-bench benchmark, generate
benchmark-derived data, or touch the eval set in any way.

Conclusion

While the workflow isn’t new, this moment is. It is undeniably strange and exciting to go to sleep while five production-grade apps are being built by loops that can see, act, check, and try again.

Agents now have eyes, tools, verifiable end states, and enough speed to run a hundred runs before morning. All that’s left is for you to go build one.