back to top
HomeTechAI ModelsTrinity-Large-Thinking: the open source brain your AI agents have been missing

Trinity-Large-Thinking: the open source brain your AI agents have been missing

- Advertisement -

Most open source models that claim agentic capability are really just instruction-tuned models with tool calling bolted on. They can call a function. They cannot think across ten steps, remember what they decided three tool calls ago, and course correct when something breaks mid-task.

This is where Trinity-Large-Thinking comes into picture. Arcee AI released it this week with 398 billion total parameters, but only 13 billion active during inference. That MoE architecture means it runs closer to a 13B model in practice while carrying the knowledge of something nearly 30 times larger. And unlike most models where reasoning stops between steps, Trinity keeps its thinking tokens alive across the entire agent loop. Every decision it makes is informed by everything it reasoned through before it.

398B parameters, 13B doing the work

Mixture-of-Experts models are not new. The idea is simple. Instead of activating the entire network for every token, the model routes each token through a small subset of specialized experts. Trinity has 256 experts total. For any given token, only 4 are active plus one shared expert. That keeps inference fast and memory requirements manageable despite the massive total parameter count.

That means, you are not running a 398B model. You are running something closer to a 13B model that has access to knowledge distributed across a network nearly 30 times larger. The speed is closer to a 13B. The capability is not.

Trinity was pretrained on 17 trillion tokens, then post-trained specifically on tool-calling trajectories, multi-step agent tasks, and reasoning chains. Most models learn reasoning as a general skill and then get applied to agentic tasks. Trinity was trained on agentic tasks directly. The reasoning and the tool use were developed together, not bolted together afterward.

Context window sits at 512k tokens. For long agentic loops with deep reasoning chains that is not a footnote, it is a requirement.

The part most agentic models get wrong

Here is where most models fall apart in real agent deployments. The model reasons through step one, calls a tool, gets a result, and moves to step two. But the reasoning from step one is gone. The model sees the tool result but not the thought process that led to calling that tool in the first place. By step five or six, it is essentially starting fresh with accumulated outputs but no memory of its own decisions.

Trinity keeps its thinking tokens in context across the entire loop. Every reasoning trace wrapped in those think blocks stays in the message history. When the model reaches step six it knows not just what happened but why it made each decision along the way. That is a meaningful architectural difference not a marketing claim.

There is a practical implication here. If you are building a multi-turn agent and you strip the thinking blocks out of the history to save context, you break the model. Trinity’s documentation is explicit about this. Preserve the think blocks. If you need to truncate history, remove entire older turns rather than stripping reasoning from recent ones.

That single constraint tells you something about how seriously the reasoning integration was designed.

What the benchmarks say

Trinity does not beat Opus 4.6 across the board. That is worth saying upfront. On general reasoning, GPQA-Diamond, MMLU-Pro, and SWE-bench, Opus 4.6 is ahead. That is expected. Opus 4.6 is a frontier closed model from one of the best AI labs in the world.

Where Trinity wins is specific and intentional. On Tau2-Airline, which tests multi-step agentic task completion in real booking scenarios, Trinity scored 88.0 against Opus 4.6’s 82.0. On Tau2-Telecom it scored 94.7 against 92.1. On LiveCodeBench, a coding benchmark that tests real programming tasks rather than small problems, Trinity scored 98.2.

These are not cherry picked easy wins. Tau2 benchmarks are designed to test whether a model can complete realistic multi-step tasks without breaking down mid-loop. Beating a frontier closed model on those specific benchmarks as an open source release is a real result.

PinchBench, which measures real world agent task performance, came in at 91.9. AIME25, a hard math reasoning benchmark, scored 96.3.

All numbers are from the model card. They come from Arcee’s own evaluations so treat them as directional rather than definitive until independent benchmarks catch up.

Not for consumer grade GPUs

Let’s be straight. Trinity-Large-Thinking is not a model you spin up on a consumer GPU. 398 billion total parameters means serious infrastructure even with only 13 billion active during inference. If you were hoping to run this locally the way you might run a Gemma or a Mistral, this is not that.

The easiest way to use it today is OpenRouter. No setup, no hardware, full reasoning and tool calling support via API out of the box.

If you are running your own infrastructure, vLLM 0.11.1 or higher is the recommended path. One thing worth knowing if you are building agent loops on top of this. Do not strip the think blocks from your message history. Trinity’s reasoning is load bearing. Remove it and you degrade the model’s ability to track its own decisions across steps. If you need to trim context, remove entire older turns instead.

Who should actually use this

If you are building production agent systems and you need an open source model at the core, Trinity is the most serious option available right now. The reasoning architecture is not a feature, it is the foundation. For teams running OpenClaw or Hermes Agent it works as a drop-in backbone. For custom agent loops it is straightforward to integrate via OpenRouter today.

If you are a solo developer experimenting with agents on a budget, OpenRouter makes it accessible without infrastructure overhead. You will not feel the 398B weight at all through the API.

If you are looking for a general purpose model for everyday tasks, coding assistance, or anything that does not involve multi-step agentic workflows, Trinity is not the right tool. Qwen 3.5 and even smaller open source models will serve you better at lower cost.

Built specifically for Agents

Arcee did not try to build another general-purpose model. They built something specific and built it well. An open source model that beats Anthropic’s best Opus 4.6 on agentic benchmarks in a fair comparison is not a small thing. It will not replace frontier models for general work. But for the narrow and increasingly important job of powering AI agents, Trinity-Large-Thinking is worth taking seriously right now.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
EmDash is what Cloudflare rebuilt WordPress for the agent-first web

EmDash: Cloudflare rebuilt WordPress for the agent-first web

0
WordPress has a problem it cannot fix from the inside. Not a performance problem. Not a features problem. A structural one. 96% of its security vulnerabilities come from plugins, and the reason is simple. Every plugin gets access to everything. The database, the filesystem, the entire execution context. That is how it was built in 2003 and that is how it still works today. Cloudflare looked at that and decided patching was the wrong answer. EmDash is their attempt to start over. Built in TypeScript, Its serverless & powered by Astro & MIT licensed. No PHP, legacy architecture or plugins that can silently access your entire database. I want to be straight about what this is right now. It is a v0.1.0 developer preview. You are not migrating your production site today. But the architecture decisions behind it are serious enough that if you build on WordPress, run a plugin business, or host WordPress sites for clients, you should understand what Cloudflare just shipped.
Gemma 4 Makes Local AI Agents Actually Practical

Gemma 4 Makes Local AI Agents Actually Practical

0
Gemma 4 is a family of four models. Two dense models built for phones and laptops, E2B and E4B. One MoE model at 26B A4B for consumer GPUs. One dense 31B for workstations and servers. All four are multimodal. Text and image input across the entire family. The two smaller models, E2B and E4B, also handle audio natively which is unusual at that size. Context window sits at 128K tokens for the small models and 256K for the larger two. Every model in the family supports function calling out of the box, which matters if you are building agents. Every model also has a thinking mode you can toggle, so you get chain of thought reasoning without a separate model.
Open-Source Dev Tools Worth Switching

6 Open Source Developer Tools Worth Switching to

0
Paid developer tools have gotten expensive. Postman wants a subscription. DataGrip wants a subscription. Design tools, API clients, database managers, recording tools. Everything is moving to SaaS and the bills add up fast. The open source alternatives have quietly gotten good enough that the switch actually makes sense now. Not as a compromise. As a genuine upgrade in some cases. These six tools have earned a place in a real development workflow. Some replace paid tools directly. Others fill gaps that paid tools never bothered addressing. All of them are free, actively maintained and worth your time.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy