back to top
HomeTechTrendsAutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its...

AutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its Own Strategy

- Advertisement -

Researchers figured out how to make AI reason more efficiently by having AI figure it out itself. By building an environment where an AI agent writes controller code, tests it, gets feedback, and rewrites it until the strategy gets better.

The result cuts token usage by roughly 70% at the same accuracy as running 64 parallel reasoning chains. That’s the difference between inference being affordable and inference being a cost problem.

The research comes from a team across UMD, UVA, WUSTL, UNC, Google, and Meta. It’s called AutoTTS, automated test-time scaling and it’s one of the more conceptually interesting papers published this year even if you can’t download a model and use it tomorrow.

What it actually discovered

The standard approach to getting better answers from a reasoning model is brute force. Run the same question through the model 64 times in parallel, collect all the answers, pick the most common one. It works but it’s also expensive, 64x the compute for every single query.

AutoTTS asked a different question. Instead of running more parallel chains and hoping majority vote wins, can a system automatically discover smarter strategies for when to branch, when to keep going, when to stop, and when to cut a reasoning path that’s going nowhere?

The answer it found is called the Confidence Momentum Controller. Rather than making fixed decisions about inference, always branch this many times, always run this many steps, the CMC watches how confident the model is across its reasoning traces and makes decisions based on trends.

If confidence is rising consistently, it stops early and answers. If confidence stagnates or drops, it opens new branches and explores. If one branch is consistently agreeing with the emerging consensus, it gets more compute. If another branch keeps diverging, it gets cut but only after persistently deviating, not on a single bad step.

The CMC wasn’t designed by a researcher. It was written by an AI agent, tested against cached reasoning traces, evaluated on accuracy and token cost, and rewritten over multiple rounds until it stopped improving. The humans built the environment. The agent wrote the policy.

The numbers that matter

results table

At the β = 0.5 operating point, the balanced setting between speed and accuracy. AutoTTS cuts token usage by roughly 69.5% compared to running 64 parallel chains. Accuracy on held-out benchmarks matches SC@64 across four different Qwen3 model sizes. Matching at 30% of the token cost.

At β = 1.0, the accuracy-first setting, the discovered controller pushes peak accuracy beyond every handcrafted baseline in five of eight comparison cells across the benchmark table. More accurate and discovered automatically.

The evaluations ran on AIME24 for discovery and held out AIME25 and HMMT25 for testing. The controller generalized policies discovered on one benchmark transferred to benchmarks the system never saw during search. That’s the result that matters most for anyone skeptical about whether this is just benchmark overfitting.

How it works

The system has two parts and it’s worth understanding both because the elegance is in how they fit together.

First, before any discovery happens, you collect reasoning traces offline. Run your model on a set of questions, save every reasoning path, chunk each path into fixed-length segments. This becomes your replay store, a cached database of how the model actually reasons across thousands of problems.

Second, a Claude Code agent writes controller code. The controller decides inference strategy, when to branch, when to stop, when to probe a reasoning path, when to cut it. The agent tests each controller against the replay store without making any new model calls. Everything is cached. A full discovery run costs $39.90 in API calls and takes 160 minutes. The replay store does the heavy lifting.

Each round the agent gets back accuracy numbers, token costs, and detailed traces of exactly what the controller did on each problem including where it branched, where it stopped, where it was wrong. It uses that feedback to rewrite the controller. Repeat for several rounds until the objective stops improving.

Its Just an agent writing increasingly better Python code in a feedback loop. The offline collection runs once per model and benchmark. After that, discovery is cheap enough that a research team can run multiple experiments in a day.

You May Like: Open Source AI Models That Actually Get Text Right in Generated Images

Who can actually use this today

You cannot download AutoTTS and plug it into your deployment tomorrow. There are no pretrained weights that gives you a 70% token reduction on your existing system. What exists is a research framework, a replay environment, and the discovered controller code.

To actually use this you need to collect your own offline reasoning traces from your target model, thousands of cached responses across your benchmark of interest. You need Claude Code running with API access for the discovery loop. You need engineering time to set up the replay environment and evaluate results. The $39.90 discovery cost assumes all of that infrastructure is already in place.

For ML researchers and inference engineers at teams actively working on test-time compute, this is immediately actionable. The framework is open, the discovered CMC controller ships with the repo and can be evaluated on their replay data without running discovery at all.

For most developers building on top of frontier models via API, this is a paper to understand and watch. The ideas will show up in production inference systems probably sooner than you’d expect, but that work hasn’t happened yet.

When the Strategy Writes Itself

Every inference efficiency gain so far has come from humans designing better strategies. AutoTTS is the first published system where the strategy design itself is automated and the discovered strategy beats the hand-designed ones.

It means inference efficiency research scales with compute and replay data rather than with how many researchers can think carefully about the problem.

This is early. One paper, one controller, two benchmarks. But the question AutoTTS is asking, can AI systems automatically discover better ways to use their own intelligence is one of the most important questions in the field right now. This paper has a credible answer for at least a narrow version of it.

Worth watching even if you can’t use it today.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenAI Built Its First AI Chip. It's Not Trying to Replace NVIDIA

OpenAI Built Its First AI Chip. It’s Not Trying to Replace NVIDIA.

0
When the news broke that OpenAI had built a custom chip, the instinct was to frame it as a NVIDIA story. Another lab trying to cut the cord, reduce dependence on H100s, claw back some margin from the company that's been printing money off the AI boom. That's not quite what's happening here. The chip is called Jalapeño, built with Broadcom, and it doesn't touch training at all. It's an inference chip, meaning it only runs models after they're already built, when a user sends a message and ChatGPT has to respond. The compute-heavy work of actually training those models still runs on NVIDIA hardware. OpenAI isn't replacing NVIDIA. It's going after a different part of the problem entirely, the part that happens millions of times a day, every time someone uses one of their products. That distinction matters because inference is where AI costs actually accumulate at scale. Training happens once per model. Inference never stops.
glm 5.2 ai open weights

GLM-5.2 Is the Closest an Open Model Has Come to Claude

0
What does it take for an open-weight model to stop chasing Claude and actually beat it? Every open-weight release for two years has told some version of the same story: closer, but not quite. The chart shrinks, the wording softens to "competitive with," and the conversation moves on until the next model repeats the cycle. GLM-5.2 breaks that pattern. The model is built to survive long, messy coding work, the kind that runs for hours without losing the thread. That's the pitch its maker is leading with. But scroll down their own benchmark table and something else is sitting there quietly: on a couple of standard math evals, this open model isn't approaching Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro. It's beating all three, on the same table. It loses plenty of ground elsewhere, and that part matters just as much as the wins. But a model anyone can download under an MIT license, with no usage restrictions attached, coming out ahead of the lab everyone else measures themselves against, is worth pausing on before getting to what the rest of the numbers actually say.
Open-Source AI Tools Worth Trying Right Now

5 Open-Source AI Tools You Probably Haven’t Tried Yet

0
Every week brings another open source AI release, and most of them require setting up a Python environment. Find out the model card lied about VRAM requirements. By the time something actually runs, the appeal has mostly worn off. The five tools below skip most of that. One turns image and video generation into something closer to a desktop app. One gives DeepSeek an actual workspace instead of a browser tab. One builds UI prototypes using coding agents you probably already have installed. One quietly builds a memory system out of your own apps. And one is, literally, a desktop pet.