Ant Group doesn’t get the coverage it deserves. While the open source AI conversation in the West circles around DeepSeek and Qwen, Ant Group has been quietly building a model family that competes directly with the models everyone is talking about.
Ling 2.6 is the latest. Two variants, a trillion parameter flagship and a lean 104B flash model with 7.4B active parameters. Both MIT licensed. Both free to try on OpenRouter right now.
Most people haven’t heard of it. The benchmarks suggest they should have.
Table of Contents
Two Models. One for Max Performance, One for Real Work.
The 1T is Ant Group’s statement model. A trillion parameters, designed for complex reasoning, long context, and the kind of multi-step agentic tasks where most models start making mistakes around step four. It’s not made for consumer hardware. This is for teams with serious infrastructure who need serious performance.
The flash is the opposite one. 104B total parameters but only 7.4B active at inference time. It hits 340 tokens per second on a 4x H20 setup. It was specifically trained to reach answers faster with fewer tokens by learning when verbose chain-of-thought reasoning is actually unnecessary. Ant Group calls it “fast thinking” and the idea is most agent tasks don’t need the model to reason out loud for three paragraphs before doing something. The flash skips that overhead when it doesn’t need it.
Both are MIT licensed. Both are on OpenRouter with a free tier.
The agentic numbers
On BFCL-V4, which tests how reliably a model calls functions correctly across different scenarios, the flash scores 66.81. GPT-OSS-120B sits at 43.30. That’s not a close race. The 1T pushes it further to 70.64, leading that entire benchmark chart.
TAU2-Bench measures real multi-step agent execution across retail, airline, and telecom scenarios. The flash scores 76.36 overall. The 1T hits 78.36. Both lead their respective comparison charts. The flash’s telecom score specifically 94.96. yeah.. this much.
SWE-bench Verified, which tests whether a model can resolve actual GitHub issues in real codebases, lands at 61.20 for the flash and 72.20 for the 1T. The 1T’s 72.20 leads GPT-5.4’s 69.20 in non-reasoning mode, which we’ll address honestly when we get to the full benchmarks.
The pattern across all of these is consistent. Where execution matters more than raw knowledge, Ling 2.6 competes with models that get significantly more attention.
Long Context Is Finally Useful Here
Most models quietly fall apart as context gets longer. You ask them to reason across a 200K token document and they start hallucinating connections that aren’t there or losing track of information from earlier in the context. It’s a known problem and most benchmark tables politely avoid showing it.
The MRCR benchmark tests exactly this, long range context retrieval across 16K to 256K tokens. The 1T scores 80.37. DeepSeek-V3.2 scores 30.50. Kimi-K2.5 scores 63.22. GPT-5.4 scores 68.43.
That 80.37 against DeepSeek-V3.2’s 30.50 is the single most surprising number However DeepSeek V4 Pro scores 83.5 on MRCR, slightly ahead, but the gap between Ling 2.6 1T and one of the most praised long context models right now is smaller. For anyone building on long documents, large codebases, or extended agent workflows where context consistency matters across hundreds of thousands of tokens, that gap is not academic. It shows up in real work.
The flash also holds up on long context. MRCR at 75.93 against GPT-OSS-120B’s 22.56 and Nemotron Super’s 39.04. Across both variants, long context handling is clearly something Ant Group invested in specifically.
Where it’s honest about being weaker
Ling 2.6 flash is actually weak in math.
AIME26 at 73.85 sounds reasonable until you see Nemotron 3 Super at 88.59 on the same benchmark. HMMT-Feb26 at 49.29 against Nemotron’s 76.23. IMO-AnswerBench at 54.28 against Nemotron’s 79.53. These aren’t close margins.
The flash was built for agent execution and token efficiency, not competition math. That tradeoff is intentional and Ant Group is upfront about it. But if your use case involves heavy mathematical reasoning, the flash is not the right tool. The 1T handles math considerably better. On AIME26 at 87.40 puts it right at the top of its comparison chart but then you’re back to needing serious infrastructure.
LiveCodeBench for the flash lands at 62.28, competitive but not leading. The 1T isn’t directly compared on LiveCodeBench in the data we have.
Knowing where a model breaks down is as useful as knowing where it excels. The flash is an agent execution model that happens to be good at coding. It is not a math model.
You May Like: MiMo-V2.5-Pro Is Now Open Source and It’s Sitting Right Next to Claude Opus 4.6 on Coding
Fast thinking
Most reasoning models think out loud. They generate long chains of thought before answering, reasoning through every step explicitly. For complex problems that’s genuinely useful. For a tool call that retrieves a customer record it’s just expensive.
Ant Group trained the flash specifically to suppress that verbosity when it isn’t needed. They call it Contextual Process Redundancy Suppression which is a mouthful but the idea is simple. The model learns to recognize when fast direct answers are appropriate and when deliberate step-by-step reasoning is actually required. It doesn’t always think out loud. It thinks out loud when it should.
The result shows up in the numbers. The flash uses 15 million tokens across the full Artificial Analysis evaluation suite while delivering competitive performance. Other models in its class use significantly more. For high-frequency agent workflows where the model is making dozens of tool calls per session, that token efficiency compounds quickly into cost savings.
The benchmarks
The 1T and flash comparison tables use different sets of competitor models. We’ve presented them separately to avoid mixing comparisons that weren’t designed to be read together.
Ling 2.6 Flash
| Benchmark | What it tests | Ling 2.6 Flash | GPT-OSS-120B | GLM-4.5-Air |
|---|---|---|---|---|
| BFCL-V4 | Tool calling | 66.81 | 43.30 | 61.73 |
| TAU2-Bench | Multi-step agent execution | 76.36 | 23.48 | 60.92 |
| SWE-bench Verified | Real GitHub issue resolution | 61.20 | — | 57.20 |
| LiveCodeBench | Coding | 62.28 | 61.51 | 43.01 |
| MRCR 16K-256K | Long context retrieval | 75.93 | 22.56 | 30.02 |
| AIME26 | Competition math | 73.85 | 60.10 | 45.16 |
| IFBench | Instruction following | 57.40 | 58.30 | 33.60 |
Ling 2.6 1T
| Benchmark | What it tests | Ling 2.6 1T | GPT-5.4 Non-Reasoning | DeepSeek-V3.2 |
|---|---|---|---|---|
| BFCL-V4 | Tool calling | 70.64 | 58.09 | 60.05 |
| TAU2-Bench | Multi-step agent execution | 78.36 | 69.53 | 75.63 |
| SWE-bench Verified | Real GitHub issue resolution | 72.20 | 69.20 | 66.40 |
| MRCR 16K-256K | Long context retrieval | 80.37 | 68.43 | 30.50 |
| AIME26 | Competition math | 87.40 | 72.92 | 66.47 |
| IFBench | Instruction following | 57.00 | 48.40 | 49.00 |
| PinchBench Avg | Agentic coding | 87.40 | 73.40 | 89.38 |
The GPT-5.4 comparisons in the 1T table are against non-reasoning mode specifically. With reasoning enabled those margins would likely look different.
You May Like: DeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It’s Open Source
How to try it
The easiest path is OpenRouter. Both variants are available there with a free tier, So you don’t need local setup or GPU. Just an API key and you can start testing against your actual use cases immediately.
For the flash, local deployment is realistic if you have the right hardware. It runs on 4x H20 with SGLang or vLLM. For most developers that’s still server territory, but it’s far more accessible than the 1T.
A trillion parameters needs serious infrastructure. OpenRouter is the practical answer for most people who want to evaluate it before committing to anything.
Both models are MIT licensed.
Who is this for
If you’re building agent workflows like multi-step tool calling, complex instruction following, long running tasks, both variants are worth evaluating seriously. The agentic numbers are where Ling 2.6 genuinely earns attention and they hold up across multiple benchmarks.
If long context is a hard requirement, the 1T’s MRCR numbers put it in the same conversation as models that get significantly more coverage. That’s not a small thing if your use case involves large codebases, long documents, or extended reasoning sessions.
If you need fast, token-efficient agent execution at scale, the flash’s 340 tokens per second and 15M token efficiency profile is a serious deployment consideration that most coverage completely ignores.
If your primary need is competition math or pure reasoning benchmarks, there are better options. The flash especially is honest about that tradeoff.
Ant Group built two models for two specific problems. They didn’t try to win every benchmark. That kind of focused engineering tends to show up in production in ways that broad generalist models sometimes don’t.




