Most open source models that claim agentic capability are really just instruction-tuned models with tool calling bolted on. They can call a function. They cannot think across ten steps, remember what they decided three tool calls ago, and course correct when something breaks mid-task.
This is where Trinity-Large-Thinking comes into picture. Arcee AI released it this week with 398 billion total parameters, but only 13 billion active during inference. That MoE architecture means it runs closer to a 13B model in practice while carrying the knowledge of something nearly 30 times larger. And unlike most models where reasoning stops between steps, Trinity keeps its thinking tokens alive across the entire agent loop. Every decision it makes is informed by everything it reasoned through before it.
Table of contents
398B parameters, 13B doing the work
Mixture-of-Experts models are not new. The idea is simple. Instead of activating the entire network for every token, the model routes each token through a small subset of specialized experts. Trinity has 256 experts total. For any given token, only 4 are active plus one shared expert. That keeps inference fast and memory requirements manageable despite the massive total parameter count.
That means, you are not running a 398B model. You are running something closer to a 13B model that has access to knowledge distributed across a network nearly 30 times larger. The speed is closer to a 13B. The capability is not.
Trinity was pretrained on 17 trillion tokens, then post-trained specifically on tool-calling trajectories, multi-step agent tasks, and reasoning chains. Most models learn reasoning as a general skill and then get applied to agentic tasks. Trinity was trained on agentic tasks directly. The reasoning and the tool use were developed together, not bolted together afterward.
Context window sits at 512k tokens. For long agentic loops with deep reasoning chains that is not a footnote, it is a requirement.
The part most agentic models get wrong
Here is where most models fall apart in real agent deployments. The model reasons through step one, calls a tool, gets a result, and moves to step two. But the reasoning from step one is gone. The model sees the tool result but not the thought process that led to calling that tool in the first place. By step five or six, it is essentially starting fresh with accumulated outputs but no memory of its own decisions.
Trinity keeps its thinking tokens in context across the entire loop. Every reasoning trace wrapped in those think blocks stays in the message history. When the model reaches step six it knows not just what happened but why it made each decision along the way. That is a meaningful architectural difference not a marketing claim.
There is a practical implication here. If you are building a multi-turn agent and you strip the thinking blocks out of the history to save context, you break the model. Trinity’s documentation is explicit about this. Preserve the think blocks. If you need to truncate history, remove entire older turns rather than stripping reasoning from recent ones.
That single constraint tells you something about how seriously the reasoning integration was designed.
What the benchmarks say
Trinity does not beat Opus 4.6 across the board. That is worth saying upfront. On general reasoning, GPQA-Diamond, MMLU-Pro, and SWE-bench, Opus 4.6 is ahead. That is expected. Opus 4.6 is a frontier closed model from one of the best AI labs in the world.
Where Trinity wins is specific and intentional. On Tau2-Airline, which tests multi-step agentic task completion in real booking scenarios, Trinity scored 88.0 against Opus 4.6’s 82.0. On Tau2-Telecom it scored 94.7 against 92.1. On LiveCodeBench, a coding benchmark that tests real programming tasks rather than small problems, Trinity scored 98.2.
These are not cherry picked easy wins. Tau2 benchmarks are designed to test whether a model can complete realistic multi-step tasks without breaking down mid-loop. Beating a frontier closed model on those specific benchmarks as an open source release is a real result.
PinchBench, which measures real world agent task performance, came in at 91.9. AIME25, a hard math reasoning benchmark, scored 96.3.
All numbers are from the model card. They come from Arcee’s own evaluations so treat them as directional rather than definitive until independent benchmarks catch up.
Not for consumer grade GPUs
Let’s be straight. Trinity-Large-Thinking is not a model you spin up on a consumer GPU. 398 billion total parameters means serious infrastructure even with only 13 billion active during inference. If you were hoping to run this locally the way you might run a Gemma or a Mistral, this is not that.
The easiest way to use it today is OpenRouter. No setup, no hardware, full reasoning and tool calling support via API out of the box.
If you are running your own infrastructure, vLLM 0.11.1 or higher is the recommended path. One thing worth knowing if you are building agent loops on top of this. Do not strip the think blocks from your message history. Trinity’s reasoning is load bearing. Remove it and you degrade the model’s ability to track its own decisions across steps. If you need to trim context, remove entire older turns instead.
Who should actually use this
If you are building production agent systems and you need an open source model at the core, Trinity is the most serious option available right now. The reasoning architecture is not a feature, it is the foundation. For teams running OpenClaw or Hermes Agent it works as a drop-in backbone. For custom agent loops it is straightforward to integrate via OpenRouter today.
If you are a solo developer experimenting with agents on a budget, OpenRouter makes it accessible without infrastructure overhead. You will not feel the 398B weight at all through the API.
If you are looking for a general purpose model for everyday tasks, coding assistance, or anything that does not involve multi-step agentic workflows, Trinity is not the right tool. Qwen 3.5 and even smaller open source models will serve you better at lower cost.
Built specifically for Agents
Arcee did not try to build another general-purpose model. They built something specific and built it well. An open source model that beats Anthropic’s best Opus 4.6 on agentic benchmarks in a fair comparison is not a small thing. It will not replace frontier models for general work. But for the narrow and increasingly important job of powering AI agents, Trinity-Large-Thinking is worth taking seriously right now.




