back to top
HomeTechAI ModelsTrinity-Large-Thinking: the open source brain your AI agents have been missing

Trinity-Large-Thinking: the open source brain your AI agents have been missing

- Advertisement -

Most open source models that claim agentic capability are really just instruction-tuned models with tool calling bolted on. They can call a function. They cannot think across ten steps, remember what they decided three tool calls ago, and course correct when something breaks mid-task.

This is where Trinity-Large-Thinking comes into picture. Arcee AI released it this week with 398 billion total parameters, but only 13 billion active during inference. That MoE architecture means it runs closer to a 13B model in practice while carrying the knowledge of something nearly 30 times larger. And unlike most models where reasoning stops between steps, Trinity keeps its thinking tokens alive across the entire agent loop. Every decision it makes is informed by everything it reasoned through before it.

398B parameters, 13B doing the work

Mixture-of-Experts models are not new. The idea is simple. Instead of activating the entire network for every token, the model routes each token through a small subset of specialized experts. Trinity has 256 experts total. For any given token, only 4 are active plus one shared expert. That keeps inference fast and memory requirements manageable despite the massive total parameter count.

That means, you are not running a 398B model. You are running something closer to a 13B model that has access to knowledge distributed across a network nearly 30 times larger. The speed is closer to a 13B. The capability is not.

Trinity was pretrained on 17 trillion tokens, then post-trained specifically on tool-calling trajectories, multi-step agent tasks, and reasoning chains. Most models learn reasoning as a general skill and then get applied to agentic tasks. Trinity was trained on agentic tasks directly. The reasoning and the tool use were developed together, not bolted together afterward.

Context window sits at 512k tokens. For long agentic loops with deep reasoning chains that is not a footnote, it is a requirement.

The part most agentic models get wrong

Here is where most models fall apart in real agent deployments. The model reasons through step one, calls a tool, gets a result, and moves to step two. But the reasoning from step one is gone. The model sees the tool result but not the thought process that led to calling that tool in the first place. By step five or six, it is essentially starting fresh with accumulated outputs but no memory of its own decisions.

Trinity keeps its thinking tokens in context across the entire loop. Every reasoning trace wrapped in those think blocks stays in the message history. When the model reaches step six it knows not just what happened but why it made each decision along the way. That is a meaningful architectural difference not a marketing claim.

There is a practical implication here. If you are building a multi-turn agent and you strip the thinking blocks out of the history to save context, you break the model. Trinity’s documentation is explicit about this. Preserve the think blocks. If you need to truncate history, remove entire older turns rather than stripping reasoning from recent ones.

That single constraint tells you something about how seriously the reasoning integration was designed.

What the benchmarks say

Trinity does not beat Opus 4.6 across the board. That is worth saying upfront. On general reasoning, GPQA-Diamond, MMLU-Pro, and SWE-bench, Opus 4.6 is ahead. That is expected. Opus 4.6 is a frontier closed model from one of the best AI labs in the world.

Where Trinity wins is specific and intentional. On Tau2-Airline, which tests multi-step agentic task completion in real booking scenarios, Trinity scored 88.0 against Opus 4.6’s 82.0. On Tau2-Telecom it scored 94.7 against 92.1. On LiveCodeBench, a coding benchmark that tests real programming tasks rather than small problems, Trinity scored 98.2.

These are not cherry picked easy wins. Tau2 benchmarks are designed to test whether a model can complete realistic multi-step tasks without breaking down mid-loop. Beating a frontier closed model on those specific benchmarks as an open source release is a real result.

PinchBench, which measures real world agent task performance, came in at 91.9. AIME25, a hard math reasoning benchmark, scored 96.3.

All numbers are from the model card. They come from Arcee’s own evaluations so treat them as directional rather than definitive until independent benchmarks catch up.

Not for consumer grade GPUs

Let’s be straight. Trinity-Large-Thinking is not a model you spin up on a consumer GPU. 398 billion total parameters means serious infrastructure even with only 13 billion active during inference. If you were hoping to run this locally the way you might run a Gemma or a Mistral, this is not that.

The easiest way to use it today is OpenRouter. No setup, no hardware, full reasoning and tool calling support via API out of the box.

If you are running your own infrastructure, vLLM 0.11.1 or higher is the recommended path. One thing worth knowing if you are building agent loops on top of this. Do not strip the think blocks from your message history. Trinity’s reasoning is load bearing. Remove it and you degrade the model’s ability to track its own decisions across steps. If you need to trim context, remove entire older turns instead.

Who should actually use this

If you are building production agent systems and you need an open source model at the core, Trinity is the most serious option available right now. The reasoning architecture is not a feature, it is the foundation. For teams running OpenClaw or Hermes Agent it works as a drop-in backbone. For custom agent loops it is straightforward to integrate via OpenRouter today.

If you are a solo developer experimenting with agents on a budget, OpenRouter makes it accessible without infrastructure overhead. You will not feel the 398B weight at all through the API.

If you are looking for a general purpose model for everyday tasks, coding assistance, or anything that does not involve multi-step agentic workflows, Trinity is not the right tool. Qwen 3.5 and even smaller open source models will serve you better at lower cost.

Built specifically for Agents

Arcee did not try to build another general-purpose model. They built something specific and built it well. An open source model that beats Anthropic’s best Opus 4.6 on agentic benchmarks in a fair comparison is not a small thing. It will not replace frontier models for general work. But for the narrow and increasingly important job of powering AI agents, Trinity-Large-Thinking is worth taking seriously right now.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Google's Next AI Bet Isn't on Chatbots. It's on Agents That Do the Work

Google’s Next AI Bet Isn’t on Chatbots. It’s on Agents That Do the Work.

0
For the last three years, Google has been playing catch-up in the chatbot race. ChatGPT arrived, Gemini followed, and the conversation quickly became about which AI could answer questions better, faster, and more accurately. Google I/O this week suggested the company is done competing on chat alone. Gemini 3.5 Flash launched Tuesday, and Google barely framed it as a conversational product. Instead, the company focused on coding pipelines, autonomous research, multi-agent coordination, and one demo that stood out across the industry: building an operating system from scratch with minimal human input. The model can reportedly operate autonomously for hours. Google says it’s up to 4× faster than other frontier models, with an optimized version reaching 12× faster speeds at similar quality.
Andrej Karpathy Is Joining Anthropic. What It Says About Where AI Is Heading

Andrej Karpathy Joined Anthropic. What It Says About Where AI Is Heading.

0
Andrej Karpathy doesn't make random career moves. He co-founded OpenAI in 2015, left to build Tesla's self-driving program, came back to OpenAI for a year, then left again in 2024 to start an AI education company. Every transition has been deliberate and every one of them has turned out to be worth paying attention to. On Tuesday he posted on X that he's joined Anthropic. "I think the next few years at the frontier of LLMs will be especially formative," he wrote. "I am very excited to join the team here and get back to R&D." The "get back to R&D" part is the signal. Karpathy has spent the last several years teaching, building, and explaining. Now he's going back to the frontier. And the specific place he's going says something about where the most important work in AI actually is right now.
Elon Musk Lost His OpenAI Lawsuit. The Jury Never Actually Decided If He Was Right

Elon Musk Lost His OpenAI Lawsuit. The Bigger Question Was Never Put to the...

0
Elon Musk spent months in a California courtroom trying to prove that Sam Altman stole a charity. He got nine jurors, weeks of testimony from some of the biggest names in Silicon Valley, and a front row seat to the most revealing airing of OpenAI's founding history ever put on public record. Then the jury came back in under two hours and told him he'd filed too late. Not that he was wrong. Not that Altman and Brockman acted properly. Just that whatever happened between them and Musk, the legal clock had already run out before he decided to do something about it. The question of whether OpenAI actually betrayed its founding mission, the question that made this case worth following in the first place never got answered.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy