back to top
HomeTechShopify's CEO Let Karpathy's AI Agent Run Overnight and Woke Up to...

Shopify’s CEO Let Karpathy’s AI Agent Run Overnight and Woke Up to a 19% Better Model

- Advertisement -

On Sunday, Shopify CEO Tobi Lütke did something most machine learning engineers spend months trying to achieve. He improved a core model’s performance by 19% while he was asleep & didn’t use a massive compute cluster or a team of researchers. He used a 630-line weekend project released by Andrej Karpathy (former co-founder of OpenAI), called autoresearch.

By the time he woke up, the agent had run 37 experiments, tested dozens of hyperparameter combinations, and handed him a 0.8B model that outperformed the 1.6B model it was meant to replace.

Karpathy’s response when he heard? “Who knew early singularity could be this fun.”

That’s the story everyone is sharing. But the more interesting story is what autoresearch actually is, how it works, and what it quietly says about where AI research is heading.

So What Actually Is Autoresearch?

Andrej Karpathy calling Autoresearchers Post-AGI

Think of it as a self-running research lab that fits on a single GPU.

Karpathy built it on top of his own nanochat training core, stripped down to just 630 lines of code. The whole thing revolves around three files. A data prep file that never gets touched. A training script that the AI agent edits freely. And a Markdown file called program.md that you write yourself.

That last one is the key. You don’t touch Python. You write plain English instructions telling the agent what to try and what to optimize for. The agent takes it from there.

Every experiment runs for exactly five minutes. If the model improves, the change gets committed. If it doesn’t, it gets rolled back automatically. Then it runs another experiment. And another. All night if you want.

At roughly 12 experiments per hour you can wake up to over 100 completed experiments and a model that’s genuinely better than the one you went to sleep with.

That’s it. No complex setup required.

The Overnight Experiment That Changed How I Think About AI Research

Lütke wasn’t doing anything exotic. Before going to bed he pointed the agent at an internal query expansion model, wrote instructions telling it to optimize for quality and speed, and let it run. Eight hours later he had a log of 37 completed experiments, each one tested, evaluated, and either kept or discarded automatically.

He described watching the agent reason through experiments as mesmerizing and said he learned more from that single overnight run than from months of following ML researchers.

That’s not a small claim. Lütke runs one of the largest e-commerce platforms in the world and has been building software for decades. When someone like that says an overnight agent run taught him more than months of conventional research, it’s worth paying attention to.

The result was a 0.8B model that outperformed the 1.6B model it was meant to replace. A smaller, faster, cheaper model — discovered automatically overnight.

Karpathy’s response when he heard: “Who knew early singularity could be this fun.”

The Ratchet: Why It Only Gets Better and Never Worse

The cleverness of autoresearch isn’t the AI part. It’s the ratchet mechanism underneath it.

Every experiment is measured using validation bits-per-byte, a metric that works regardless of model size or architecture changes. If an experiment improves that number, the change gets committed to a git branch and becomes the new baseline. If it doesn’t improve, git resets it like it never happened.

The agent never goes backwards. Every kept change builds on the last one. Over time the model only moves in one direction.

The fixed five minute training window is what makes this work cleanly. Every experiment runs for exactly the same amount of time regardless of what the agent changed — model size, batch size, architecture, optimizer settings. That makes every experiment directly comparable. No experiment gets an unfair advantage just because it ran longer.

In Karpathy’s own runs on 8x H100 GPUs, the agent completed 276 experiments over multiple days and kept 29 improvements. More importantly, the improvements discovered on a smaller depth 12 model transferred cleanly to a depth 24 model. The agent wasn’t just finding lucky shortcuts. It was finding genuinely meaningful insights.

The Real Division of Labor

This is the part many are missing about AutoResearch

The headline is “AI runs experiments automatically.” But what that actually means for the person using it is more interesting. You are no longer the one editing Python files, tweaking hyperparameters, and waiting for training runs to finish. That entire cycle gets handed off.

What you’re left with is program.md.

A Markdown file. Plain English. You write what you want the agent to try, what you want it to optimize for, what constraints matter. The cleaner and more precise your instructions, the better the agent navigates the search space. The quality of your research org is now determined by how well you can write a strategy document, not how well you can write a training loop.

Karpathy has talked about this shift before. The primary user of many systems is increasingly an LLM, not a human. Autoresearch makes that concrete. The human becomes the strategist. The agent becomes the executor.

Also Read: Just After Launching Qwen3.5, Qwen’s Core Team Walked Out. Is This the Last Great Qwen Model?

Does It Actually Work?

The Shopify result is compelling but it’s one overnight run by a non-ML engineer. Fair to ask if it holds up beyond that.

Karpathy’s own runs answer that. On the bigger nanochat system using 8x H100 GPUs, the agent ran 276 experiments across multiple days. 29 improvements were kept. Validation loss moved steadily downward across the entire run.

The more interesting result is what happened when those improvements were tested on a larger model. The insights discovered at depth 12 transferred cleanly to a depth 24 model. That matters because it means the agent wasn’t just finding lucky tricks that work at one specific scale. It was finding something more fundamental about the architecture and training setup.

Karpathy notes in the README that the codebase is allegedly in its 10,205th generation. He admits nobody could verify that at this point. The self-modifying nature of the system makes it genuinely hard to trace every change back to its origin.

That’s either exciting or unsettling depending on how you look at it. Probably both.

What You Actually Need to Run This

The honest answer is that the base project was tested on H100 GPUs. That’s enterprise hardware, not something sitting on most desks.

But the community moved fast. Within days of the release forks appeared for MacOS, Windows with consumer RTX cards, and Apple Silicon. Karpathy himself acknowledged the interest and left guidance in the README for tuning the defaults down to smaller hardware, lower model depth, smaller sequence length, reduced vocabulary size.

If you want to run the base project as Karpathy intended, you need a single NVIDIA GPU with enough VRAM to handle the training loop. For the community forks targeting consumer hardware the bar is lower but results will vary depending on your setup.

The project uses the uv package manager which keeps the setup clean. Three commands and you’re running your first experiment. The complexity is in the program.md you write, not in getting the thing installed.

If you want to try it on smaller hardware the MacOS fork by miolini and the Windows RTX fork by jsegov are the most active starting points right now.

So Is Autoresearch a Singularity?

Autoresearch isn’t the singularity. Karpathy himself would be the first to tell you that. It’s a weekend project, deliberately minimal, scoped to one GPU, one file & one metric.

But it does close a loop that researchers have been thinking about for a long time. AI being used to improve the very training code that produces better AI. That’s not theoretical anymore. It’s a GitHub repo with an MIT license and 12.8K+ stars.

The shift it points toward is worth sitting with. The researcher’s job used to be writing training code and running experiments manually. Now it’s writing program.md and letting the agent handle the cycle. The best research orgs of the next few years probably won’t be the ones with the most compute. They’ll be the ones who write the best program.md.

We aren’t at the Singularity yet. But we are officially at the point where your GPU works harder while you’re asleep than most of us do while we’re awake.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
YOU MAY ALSO LIKE
Open Source AI Image Editing Models That Challenge Google Nano Banana

6 Open Source AI Image Editing Models That Challenge Google’s Nano Banana

0
When people talk about AI image editing, the same names come up. Nano Banana, GPT Image or Maybe one or two others. And they're good, no argument there. But they all have something in common. You're on their servers, their terms & some generates watermark along with your image. What if I told you the open source community has been building real alternatives? Models you can actually run on your own hardware with no watermarks & no usage limits. Some of them are hitting benchmark scores that are really impressive. A few of them you can even build on top of, fine tune, or deploy in your own products. I went through what's out there and narrowed it down to six that are genuinely worth your time
GPT-5.4 Is Outperforming Humans at Work. But the Real Story Is What OpenAI Isn't Telling You

GPT-5.4 Is Outperforming Humans at Work. But the Real Story Is What OpenAI Isn’t...

0
OpenAI dropped their latest model yesterday and buried inside the benchmarks is a number that deserves more attention than it's getting. On GDPval, a test that puts AI agents through real professional tasks across 44 actual occupations, GPT-5.4 matched or outperformed human professionals 83% of the time. The previous version sat at 71%. That's not a small jump. And this isn't GPT writing emails or summarizing documents anymore. This version can move a mouse, click buttons, fill out forms, and work across applications the way a person sitting at a desk would. It scored 75% on OSWorld, a benchmark that tests exactly that. The average office worker scores 72.4%. The model is already better at operating a computer than most people who use one for a living & 83% is just the beginning of what this release actually means.
LTX 2.3 Is Here The AI Video Generator That Runs on Your PC and Challenges Veo 3.1

LTX 2.3 Is Here: The AI Video Generator That Runs on Your PC and...

1
wo years ago, if you wanted to generate a decent AI video, the only real option was a subscription. Pick a tool, pay monthly, generate on their servers. That was just how AI video worked. Open source models eventually closed the gap on quality, but running them locally meant terminals, dependency errors, and a lot of patience. Not everyone wanted that headache. Most people didn't. But now that just changed. On March 5th, Lightricks dropped two things at once. LTX 2.3, a major upgrade to their open source video model, and LTX Desktop, a proper video editor built entirely on top of it. Its Open Source & you can install it like any other app on your computer.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy