Mar 19 2026

How to stop your autoresearch loop from cheating

TLDR: We let an AI agent run overnight. By morning, it had abandoned our experiment and started its own. Across 71 experiments on two very different problems: training optimization and model compression, we learned that autoresearch can reliably surface real findings when the loop is tightly scoped. Loosen the guardrails, and the agent drifts within hours. The bottleneck isn't intelligence. It's everything around it.

Everything we built/ran is open-source:

codex-autoresearch-harness, Bash wrapper that forces Codex into a research loop with built-in A/B testing (Experiment 1)
reap-expert-swap, Expert pruning + dynamic swapping to fit Kimi-k2.5 in BF16 (2.5 TB) onto 8× RTX 3090s (Experiment 2)

We left an AI agent running overnight on two research experiments. When we checked in the next morning, it had stopped doing what we asked. Instead of optimizing memory usage, it had gone off on its own side quest investigating how few model weights you actually need to maintain performance. Twelve hours of compute, pointed in the wrong direction.That experience captures both sides of autoresearch right now: it's powerful enough to surface real findings autonomously, and undisciplined enough to waste a full night of GPU time if you're not watching.

Andrej Karpathy's Autoresearch

Karpathy released autoresearch last week, a framework for letting AI agents run experiments autonomously. Give it a goal, some metrics to track, and a codebase.

Let it propose changes, test them, keep or revert, repeat. He ran it for two days on Nanochat, a small model matching GPT-2 performance trained on ~$100 of compute. Codex managed to shave 10% off compute time through its own research.

The setup is minimal, just 3 files:

Each iteration, the agent reads the program and results log, proposes a change to train.py, commits it, trains, and evaluates. A change is accepted only if it hits better-or-equal metrics. Everything else gets reverted via git reset.

We wanted to know: does this hold up beyond the demo? We ran two experiments pushing autoresearch into different territory, one on training, one on inference, to find out :)

What We Ran

Experiment 1: Training optimization

We wrapped Codex in a bash loop (codex-autoresearch-harness) and A/B tested two different LLMs as the "researcher", GPT-5.4 and Codex-Spark, on Karpathy's nanochat model. Codex doesn't natively support looping, so we built a harness to force it. Our question: can autoresearch actually do autonomous research, and how do different models behave/influence the final model?

Experiment 2: Inference optimization

We pointed autoresearch at a completely different problem, fitting Kimi-k2.5 in BF16, a 2.5 TB model onto $8K of consumer GPUs (reap-expert-swap). Our question: can agents figure out how to compress a model without destroying its ability to reason?

Here's what happened.

Experiment 1: Making Codex Loop

Karpathy noted that Codex doesn't really work with autoresearch, codex exec runs once and exits. There's no built-in loop mode. So we built one.

The Harness

codex-autoresearch-harness puts codex exec inside a bash loop with built-in A/B testing and one experiment per call. Each iteration reads the program and results from disk, proposes a code change, commits, trains for 5 minutes, evaluates, keeps or reverts, logs to TSV, and exits. Bash catches the exit and starts a fresh call.

For A/B testing, launch_ab.sh runs Model A for N hours, then Model B for N hours from the same baseline, and compare_results.sh generates a side-by-side analysis.

One experiment per call was a deliberate architectural choice: it prevents the context window from overflowing (training output is verbose), gives clean error recovery (a crash just means the next iteration starts fresh), and guarantees state separation through git history and the persistent log.

Setup Pitfalls

Some annoying things that happened and how you can fix them:

Every one of these is the same gap: current sandboxing treats GPU access and long processes as threats. Autoresearch treats them as requirements.

Results: Convergent Discovery and Proposal Quality

Each model got 6 hours on the same H100, same baseline:

Both models independently discovered the same optimization, and proposal quality mattered far more than inference speed. Both models independently converged on learning rate warmdown scheduling as the primary lever. GPT-5.4 systematically hill-climbed the warmdown ratio from 0.5 to 0.95. Spark found the same strategy but with messier proposals. That convergence suggests the search landscape for small training improvements has real structure, different agents find the same peaks.

The accept rates tell more. GPT-5.4 accepted 67% of its proposals. Spark accepted 17%. Not because Spark proposed bad ideas, it proposed more ideas, many of them bold. But each rejected proposal still burns 5 minutes of GPU time. Spark's speed advantage (~35 seconds faster per call) let it attempt nearly twice as many proposals, but most were wasted training runs.

When each experiment costs real compute, proposal quality dominates total cost. A model that proposes fewer, better-targeted changes outperforms one that explores broadly, at least against a strict gate.

Experiment 2: Shrinking a Giant Model onto Consumer GPUs

The second experiment (reap-expert-swap) tackled a different problem: running a massive AI model on hardware that's way too small for it.

The target model needs ~2.5 TB of VRAM. Our hardware: 8× RTX 3090s with 192 GB total. About 13× too small. In dollar terms, ~$300K of GPUs versus ~$8K.

Our approach had two phases: first, compress the model permanently to fit on our hardware. Then, see if we could go further by loading and unloading parts of the model dynamically based on what it actually needs.

Phase 1: Static Compression, This Worked (not autoresearch)

Using REAP (expert pruning from Cerebras) plus INT4 quantization, we compressed the model from 717 GB down to 92 GB. That's 7.8× compression, enough to fit on our consumer GPUs.

For context, GLM-4.7 in its original BF16 would require 4 B300s at ~$20/hour. GLM-4.7-Reap fits with full context on just 1 H200, 10× less cost.

Static compression is a permanent decision, you pick which parts of the model to keep, and that's it forever. The question: could you do better by choosing which parts to use dynamically based on what the model actually needs moment-to-moment?

Phase 2: Dynamic Expert Swapping (using autoresearch)

We profiled how the model routes tokens across its "experts", specialized sub-networks that each handle different types of input. The results were wild: only about 7.6% of experts per layer carry 50% of the routing traffic. This means that out of 256 experts per layer, only ~19 are needed to cover half the tokens. The full model is huge, but the active slice at any given moment is small.

Instead of permanently deleting experts, we built a system to swap them in and out dynamically, like a smart cache. Load what you need, unload what you don't.

This idea is picking up steam across the community (llama.cpp, vLLM). The swap performance was promising: transitioning between two configurations took only 0.151 seconds. Sub-second swaps after the initial setup.For this step, we used autoresearch to test whether an autonomous agent could figure out which experts to swap when.

Autonomous Search: 19 Experiments, and What They Actually Taught Us

We set an aggressive target: run at only 20% of the model's memory footprint. GPT-5.4-Pro, GPT-5.4, and Codex-Spark ran as the research workers, proposing configurations, testing, and logging results.

Out of 19 major experiments, the best result: 38% retained accuracy. What actually happened was that the agent had drifted. Instead of sticking to our original objective, it decided on its own to answer a different question: how little of the model do you actually need to maintain 95%+ accuracy? It tested by masking experts in memory, which meant the model was losing intelligence without any actual memory savings.

The finding itself was useful (about 37% of experts covers 95% of use-cases), but it polluted the repo context and sent all subsequent iterations farther off track.

After 12 hours unattended, the experiments had drifted even more. What fixed it: clearing the environment of distracting context, creating clean isolated directories for each experiment, adding stricter and more frequent validation checkpoints, and actively re-steering the agent's focus.

The real lesson from Experiment 2 wasn't about the agent succeeding or failing at compression. It was that the infrastructure and task framing determined whether the agent explored productively or spiraled. A tightly scoped environment with strict validation (Phase 1) produced clean results. A loosely scoped environment with infrequent check-ins (the autonomous search) produced drift within hours.

What Both Experiments Tell Us

Strip away the specifics and both experiments followed the same pattern: define a measurable objective, give the agent the code that controls it, enforce a strict gate, log everything, repeat. When we did that well, the results were useful. When we loosened the guardrails, the agent drifted.

Three things stood out across both:

Different agents converge on the same answers. In Experiment 1, two models independently discovered learning rate warmdown scheduling. That convergence suggests autoresearch is finding real structure in the search landscape, not noise.

Proposal quality dominates total cost. GPT-5.4 wasted 20 minutes of GPU time on rejected proposals. Spark wasted 2 hours. When each experiment costs real compute, a model that proposes fewer, better-targeted changes wins, even if it's slower per call.

Environment design matters more than model choice. The same agents that produced clean, convergent results in Experiment 1 (tightly scoped, one experiment per call, strict gate) drifted badly in Experiment 2 when given a loose objective and infrequent validation. The infrastructure and task framing determined whether the agent explored productively or spiraled.

And it's spreading beyond training. Shopify's CEO pointed autoresearch at a query-expansion model and woke up to a 0.8B model scoring 19% higher than their previous 1.6B after 37 experiments. Others have aimed it at quantitative finance, marketing A/B tests, and prediction models. autoresearch@home is already coordinating 95+ agents across the internet, running 2,600+ experiments collectively. Anything with a measurable metric is fair game.

The Gap Is Tooling, Not Intelligence

We spent more time debugging sandbox permissions than running actual experiments. CUDA couldn't see our GPU. The package manager broke in isolation. We piped the API key in manually because Codex ignores the environment variable.

None of that had anything to do with research. And that's kind of the point. The agents found real results when we got out of their way. The methodology works. The research loop is sound. What's missing is the infrastructure to run it without babysitting.

But it's moving fast. autoresearch@home launched a week ago and already has 95 agents running 2,600+ experiments, reading each other's results, pivoting strategies in real time. By day 3 they'd moved past hyperparameter tuning into architectural breakthroughs. That's the beginning of what Karpathy described: a whole research community of agents working in parallel.

We ran 71 experiments with bash scripts and tmux. We're excited for the next version of tools and workflows that won't need us watching a terminal at 2am :)

Code: codex-autoresearch-harness · reap-expert-swap · Built on Karpathy's autoresearch