back to top
HomeTechAnt Group's Ling 2.6 Came Out of Nowhere and It's Competing With...

Ant Group’s Ling 2.6 Came Out of Nowhere and It’s Competing With GPT-5.4 on Agentic Tasks

- Advertisement -

Ant Group doesn’t get the coverage it deserves. While the open source AI conversation in the West circles around DeepSeek and Qwen, Ant Group has been quietly building a model family that competes directly with the models everyone is talking about.

Ling 2.6 is the latest. Two variants, a trillion parameter flagship and a lean 104B flash model with 7.4B active parameters. Both MIT licensed. Both free to try on OpenRouter right now.

Most people haven’t heard of it. The benchmarks suggest they should have.

Two Models. One for Max Performance, One for Real Work.

The 1T is Ant Group’s statement model. A trillion parameters, designed for complex reasoning, long context, and the kind of multi-step agentic tasks where most models start making mistakes around step four. It’s not made for consumer hardware. This is for teams with serious infrastructure who need serious performance.

The flash is the opposite one. 104B total parameters but only 7.4B active at inference time. It hits 340 tokens per second on a 4x H20 setup. It was specifically trained to reach answers faster with fewer tokens by learning when verbose chain-of-thought reasoning is actually unnecessary. Ant Group calls it “fast thinking” and the idea is most agent tasks don’t need the model to reason out loud for three paragraphs before doing something. The flash skips that overhead when it doesn’t need it.

Both are MIT licensed. Both are on OpenRouter with a free tier.

The agentic numbers

On BFCL-V4, which tests how reliably a model calls functions correctly across different scenarios, the flash scores 66.81. GPT-OSS-120B sits at 43.30. That’s not a close race. The 1T pushes it further to 70.64, leading that entire benchmark chart.

TAU2-Bench measures real multi-step agent execution across retail, airline, and telecom scenarios. The flash scores 76.36 overall. The 1T hits 78.36. Both lead their respective comparison charts. The flash’s telecom score specifically 94.96. yeah.. this much.

SWE-bench Verified, which tests whether a model can resolve actual GitHub issues in real codebases, lands at 61.20 for the flash and 72.20 for the 1T. The 1T’s 72.20 leads GPT-5.4’s 69.20 in non-reasoning mode, which we’ll address honestly when we get to the full benchmarks.

The pattern across all of these is consistent. Where execution matters more than raw knowledge, Ling 2.6 competes with models that get significantly more attention.

Long Context Is Finally Useful Here

Most models quietly fall apart as context gets longer. You ask them to reason across a 200K token document and they start hallucinating connections that aren’t there or losing track of information from earlier in the context. It’s a known problem and most benchmark tables politely avoid showing it.

The MRCR benchmark tests exactly this, long range context retrieval across 16K to 256K tokens. The 1T scores 80.37. DeepSeek-V3.2 scores 30.50. Kimi-K2.5 scores 63.22. GPT-5.4 scores 68.43.

That 80.37 against DeepSeek-V3.2’s 30.50 is the single most surprising number However DeepSeek V4 Pro scores 83.5 on MRCR, slightly ahead, but the gap between Ling 2.6 1T and one of the most praised long context models right now is smaller. For anyone building on long documents, large codebases, or extended agent workflows where context consistency matters across hundreds of thousands of tokens, that gap is not academic. It shows up in real work.

The flash also holds up on long context. MRCR at 75.93 against GPT-OSS-120B’s 22.56 and Nemotron Super’s 39.04. Across both variants, long context handling is clearly something Ant Group invested in specifically.

Where it’s honest about being weaker

Ling 2.6 flash is actually weak in math.

AIME26 at 73.85 sounds reasonable until you see Nemotron 3 Super at 88.59 on the same benchmark. HMMT-Feb26 at 49.29 against Nemotron’s 76.23. IMO-AnswerBench at 54.28 against Nemotron’s 79.53. These aren’t close margins.

The flash was built for agent execution and token efficiency, not competition math. That tradeoff is intentional and Ant Group is upfront about it. But if your use case involves heavy mathematical reasoning, the flash is not the right tool. The 1T handles math considerably better. On AIME26 at 87.40 puts it right at the top of its comparison chart but then you’re back to needing serious infrastructure.

LiveCodeBench for the flash lands at 62.28, competitive but not leading. The 1T isn’t directly compared on LiveCodeBench in the data we have.

Knowing where a model breaks down is as useful as knowing where it excels. The flash is an agent execution model that happens to be good at coding. It is not a math model.

You May Like: MiMo-V2.5-Pro Is Now Open Source and It’s Sitting Right Next to Claude Opus 4.6 on Coding

Fast thinking

Most reasoning models think out loud. They generate long chains of thought before answering, reasoning through every step explicitly. For complex problems that’s genuinely useful. For a tool call that retrieves a customer record it’s just expensive.

Ant Group trained the flash specifically to suppress that verbosity when it isn’t needed. They call it Contextual Process Redundancy Suppression which is a mouthful but the idea is simple. The model learns to recognize when fast direct answers are appropriate and when deliberate step-by-step reasoning is actually required. It doesn’t always think out loud. It thinks out loud when it should.

The result shows up in the numbers. The flash uses 15 million tokens across the full Artificial Analysis evaluation suite while delivering competitive performance. Other models in its class use significantly more. For high-frequency agent workflows where the model is making dozens of tool calls per session, that token efficiency compounds quickly into cost savings.

The benchmarks

The 1T and flash comparison tables use different sets of competitor models. We’ve presented them separately to avoid mixing comparisons that weren’t designed to be read together.

Ling 2.6 Flash

BenchmarkWhat it testsLing 2.6 FlashGPT-OSS-120BGLM-4.5-Air
BFCL-V4Tool calling66.8143.3061.73
TAU2-BenchMulti-step agent execution76.3623.4860.92
SWE-bench VerifiedReal GitHub issue resolution61.2057.20
LiveCodeBenchCoding62.2861.5143.01
MRCR 16K-256KLong context retrieval75.9322.5630.02
AIME26Competition math73.8560.1045.16
IFBenchInstruction following57.4058.3033.60

Ling 2.6 1T

BenchmarkWhat it testsLing 2.6 1TGPT-5.4 Non-ReasoningDeepSeek-V3.2
BFCL-V4Tool calling70.6458.0960.05
TAU2-BenchMulti-step agent execution78.3669.5375.63
SWE-bench VerifiedReal GitHub issue resolution72.2069.2066.40
MRCR 16K-256KLong context retrieval80.3768.4330.50
AIME26Competition math87.4072.9266.47
IFBenchInstruction following57.0048.4049.00
PinchBench AvgAgentic coding87.4073.4089.38

The GPT-5.4 comparisons in the 1T table are against non-reasoning mode specifically. With reasoning enabled those margins would likely look different.

You May Like: DeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It’s Open Source

How to try it

The easiest path is OpenRouter. Both variants are available there with a free tier, So you don’t need local setup or GPU. Just an API key and you can start testing against your actual use cases immediately.

For the flash, local deployment is realistic if you have the right hardware. It runs on 4x H20 with SGLang or vLLM. For most developers that’s still server territory, but it’s far more accessible than the 1T.

A trillion parameters needs serious infrastructure. OpenRouter is the practical answer for most people who want to evaluate it before committing to anything.

Both models are MIT licensed.

Who is this for

If you’re building agent workflows like multi-step tool calling, complex instruction following, long running tasks, both variants are worth evaluating seriously. The agentic numbers are where Ling 2.6 genuinely earns attention and they hold up across multiple benchmarks.

If long context is a hard requirement, the 1T’s MRCR numbers put it in the same conversation as models that get significantly more coverage. That’s not a small thing if your use case involves large codebases, long documents, or extended reasoning sessions.

If you need fast, token-efficient agent execution at scale, the flash’s 340 tokens per second and 15M token efficiency profile is a serious deployment consideration that most coverage completely ignores.

If your primary need is competition math or pure reasoning benchmarks, there are better options. The flash especially is honest about that tradeoff.

Ant Group built two models for two specific problems. They didn’t try to win every benchmark. That kind of focused engineering tends to show up in production in ways that broad generalist models sometimes don’t.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
NVIDIA Built Nemotron 3 Nano Omni to Handle Everything. Here’s the Catch

NVIDIA Built Nemotron 3 Nano Omni to Handle Everything. Here’s the Catch

0
NVIDIA already controls the hardware most AI models run on. Now they want a say in which models run on that hardware too. Nemotron 3 Nano Omni is their latest move in that direction. It's an omnimodal model that can handle text, images, video, and audio natively in one architecture. The 30B total parameter count with 3B active makes it approachable for serious deployment without needing heavy hardware. The architecture underneath it is genuinely unusual. And the benchmark numbers on document intelligence and video understanding are strong enough to take seriously. But there is a catch. Actually there are a few.
sensenova u1 multimodal opensource

SenseNova-U1: Open Source AI That Understands and Generates Images in One Model

0
Most multimodal models are text models with image handling bolted on. A vision encoder reads the image, converts it into tokens the language model understands, and the two systems communicate through that translation layer. It works. It's also where things break down when text and image content need to stay tightly in sync. SenseNova-U1 takes a different approach. Released by SenseTime under Apache 2.0, it removes the visual encoder and VAE entirely. No translation layer or separate systems. Pixel and word information modeled together from the start. The technical report isn't out yet and the A3B variant is still pending. But the 8B weights are available now.
Best AI Coding AI Models for Consumer Hardware

Best AI Coding Models for Consumer Hardware (5 You Can Run Locally)

0
The open source model space has genuinely caught up. There are models today that genuinely rival GPT-5 and Claude Opus level performance and you can download their weights for free. The problem is running them. A 70B model at full precision wants an A100. Most developers aren't working with that. They're on an M2 MacBook Pro, an RTX 4060, maybe a gaming PC with 16GB of VRAM. That's exactly the hardware gap these five models are trying to close. All open source and capable enough to handle real coding work, and runnable on mid-range consumer hardware

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy