back to top
HomeTechAnt Group's Ling 2.6 Came Out of Nowhere and It's Competing With...

Ant Group’s Ling 2.6 Came Out of Nowhere and It’s Competing With GPT-5.4 on Agentic Tasks

- Advertisement -

Ant Group doesn’t get the coverage it deserves. While the open source AI conversation in the West circles around DeepSeek and Qwen, Ant Group has been quietly building a model family that competes directly with the models everyone is talking about.

Ling 2.6 is the latest. Two variants, a trillion parameter flagship and a lean 104B flash model with 7.4B active parameters. Both MIT licensed. Both free to try on OpenRouter right now.

Most people haven’t heard of it. The benchmarks suggest they should have.

Two Models. One for Max Performance, One for Real Work.

The 1T is Ant Group’s statement model. A trillion parameters, designed for complex reasoning, long context, and the kind of multi-step agentic tasks where most models start making mistakes around step four. It’s not made for consumer hardware. This is for teams with serious infrastructure who need serious performance.

The flash is the opposite one. 104B total parameters but only 7.4B active at inference time. It hits 340 tokens per second on a 4x H20 setup. It was specifically trained to reach answers faster with fewer tokens by learning when verbose chain-of-thought reasoning is actually unnecessary. Ant Group calls it “fast thinking” and the idea is most agent tasks don’t need the model to reason out loud for three paragraphs before doing something. The flash skips that overhead when it doesn’t need it.

Both are MIT licensed. Both are on OpenRouter with a free tier.

The agentic numbers

On BFCL-V4, which tests how reliably a model calls functions correctly across different scenarios, the flash scores 66.81. GPT-OSS-120B sits at 43.30. That’s not a close race. The 1T pushes it further to 70.64, leading that entire benchmark chart.

TAU2-Bench measures real multi-step agent execution across retail, airline, and telecom scenarios. The flash scores 76.36 overall. The 1T hits 78.36. Both lead their respective comparison charts. The flash’s telecom score specifically 94.96. yeah.. this much.

SWE-bench Verified, which tests whether a model can resolve actual GitHub issues in real codebases, lands at 61.20 for the flash and 72.20 for the 1T. The 1T’s 72.20 leads GPT-5.4’s 69.20 in non-reasoning mode, which we’ll address honestly when we get to the full benchmarks.

The pattern across all of these is consistent. Where execution matters more than raw knowledge, Ling 2.6 competes with models that get significantly more attention.

Long Context Is Finally Useful Here

Most models quietly fall apart as context gets longer. You ask them to reason across a 200K token document and they start hallucinating connections that aren’t there or losing track of information from earlier in the context. It’s a known problem and most benchmark tables politely avoid showing it.

The MRCR benchmark tests exactly this, long range context retrieval across 16K to 256K tokens. The 1T scores 80.37. DeepSeek-V3.2 scores 30.50. Kimi-K2.5 scores 63.22. GPT-5.4 scores 68.43.

That 80.37 against DeepSeek-V3.2’s 30.50 is the single most surprising number However DeepSeek V4 Pro scores 83.5 on MRCR, slightly ahead, but the gap between Ling 2.6 1T and one of the most praised long context models right now is smaller. For anyone building on long documents, large codebases, or extended agent workflows where context consistency matters across hundreds of thousands of tokens, that gap is not academic. It shows up in real work.

The flash also holds up on long context. MRCR at 75.93 against GPT-OSS-120B’s 22.56 and Nemotron Super’s 39.04. Across both variants, long context handling is clearly something Ant Group invested in specifically.

Where it’s honest about being weaker

Ling 2.6 flash is actually weak in math.

AIME26 at 73.85 sounds reasonable until you see Nemotron 3 Super at 88.59 on the same benchmark. HMMT-Feb26 at 49.29 against Nemotron’s 76.23. IMO-AnswerBench at 54.28 against Nemotron’s 79.53. These aren’t close margins.

The flash was built for agent execution and token efficiency, not competition math. That tradeoff is intentional and Ant Group is upfront about it. But if your use case involves heavy mathematical reasoning, the flash is not the right tool. The 1T handles math considerably better. On AIME26 at 87.40 puts it right at the top of its comparison chart but then you’re back to needing serious infrastructure.

LiveCodeBench for the flash lands at 62.28, competitive but not leading. The 1T isn’t directly compared on LiveCodeBench in the data we have.

Knowing where a model breaks down is as useful as knowing where it excels. The flash is an agent execution model that happens to be good at coding. It is not a math model.

You May Like: MiMo-V2.5-Pro Is Now Open Source and It’s Sitting Right Next to Claude Opus 4.6 on Coding

Fast thinking

Most reasoning models think out loud. They generate long chains of thought before answering, reasoning through every step explicitly. For complex problems that’s genuinely useful. For a tool call that retrieves a customer record it’s just expensive.

Ant Group trained the flash specifically to suppress that verbosity when it isn’t needed. They call it Contextual Process Redundancy Suppression which is a mouthful but the idea is simple. The model learns to recognize when fast direct answers are appropriate and when deliberate step-by-step reasoning is actually required. It doesn’t always think out loud. It thinks out loud when it should.

The result shows up in the numbers. The flash uses 15 million tokens across the full Artificial Analysis evaluation suite while delivering competitive performance. Other models in its class use significantly more. For high-frequency agent workflows where the model is making dozens of tool calls per session, that token efficiency compounds quickly into cost savings.

The benchmarks

The 1T and flash comparison tables use different sets of competitor models. We’ve presented them separately to avoid mixing comparisons that weren’t designed to be read together.

Ling 2.6 Flash

BenchmarkWhat it testsLing 2.6 FlashGPT-OSS-120BGLM-4.5-Air
BFCL-V4Tool calling66.8143.3061.73
TAU2-BenchMulti-step agent execution76.3623.4860.92
SWE-bench VerifiedReal GitHub issue resolution61.2057.20
LiveCodeBenchCoding62.2861.5143.01
MRCR 16K-256KLong context retrieval75.9322.5630.02
AIME26Competition math73.8560.1045.16
IFBenchInstruction following57.4058.3033.60

Ling 2.6 1T

BenchmarkWhat it testsLing 2.6 1TGPT-5.4 Non-ReasoningDeepSeek-V3.2
BFCL-V4Tool calling70.6458.0960.05
TAU2-BenchMulti-step agent execution78.3669.5375.63
SWE-bench VerifiedReal GitHub issue resolution72.2069.2066.40
MRCR 16K-256KLong context retrieval80.3768.4330.50
AIME26Competition math87.4072.9266.47
IFBenchInstruction following57.0048.4049.00
PinchBench AvgAgentic coding87.4073.4089.38

The GPT-5.4 comparisons in the 1T table are against non-reasoning mode specifically. With reasoning enabled those margins would likely look different.

You May Like: DeepSeek-V4 Can Hold Your Entire Codebase in One Context Window and It’s Open Source

How to try it

The easiest path is OpenRouter. Both variants are available there with a free tier, So you don’t need local setup or GPU. Just an API key and you can start testing against your actual use cases immediately.

For the flash, local deployment is realistic if you have the right hardware. It runs on 4x H20 with SGLang or vLLM. For most developers that’s still server territory, but it’s far more accessible than the 1T.

A trillion parameters needs serious infrastructure. OpenRouter is the practical answer for most people who want to evaluate it before committing to anything.

Both models are MIT licensed.

Who is this for

If you’re building agent workflows like multi-step tool calling, complex instruction following, long running tasks, both variants are worth evaluating seriously. The agentic numbers are where Ling 2.6 genuinely earns attention and they hold up across multiple benchmarks.

If long context is a hard requirement, the 1T’s MRCR numbers put it in the same conversation as models that get significantly more coverage. That’s not a small thing if your use case involves large codebases, long documents, or extended reasoning sessions.

If you need fast, token-efficient agent execution at scale, the flash’s 340 tokens per second and 15M token efficiency profile is a serious deployment consideration that most coverage completely ignores.

If your primary need is competition math or pure reasoning benchmarks, there are better options. The flash especially is honest about that tradeoff.

Ant Group built two models for two specific problems. They didn’t try to win every benchmark. That kind of focused engineering tends to show up in production in ways that broad generalist models sometimes don’t.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
MiniCPM-V 4.6 The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

0
The assumption has always been that serious AI runs on serious hardware. Your phone gets the watered-down version, good enough for a demo but not for real work. MiniCPM-V 4.6 is a direct challenge to that assumption. 1.3 billion parameters. Runs on iOS, Android, and HarmonyOS. Needs 4GB of GPU memory or 2GB on CPU via GGUF. And on the Artificial Analysis Intelligence Index it scores 13 against Qwen3.5-0.8B's score of 10 at 19x lower token cost, and against Qwen3.5-0.8B-Thinking's score of 11 at 43x lower token cost. These parts matter when it comes to a model which runs on your phone.
OpenAI’s Daybreak Wants to Fix Vulnerabilities Before Hackers Exploit Them

OpenAI’s Daybreak Wants to Fix Vulnerabilities Before Hackers Exploit Them

0
OpenAI just launched Daybreak, a new cybersecurity initiative built around one uncomfortable reality, AI is speeding up vulnerability discovery faster than most companies can patch the damage. Earlier this year, HackerOne temporarily paused parts of its bug bounty program because maintainers were getting flooded with AI-assisted vulnerability reports. Some were valid. Some were hallucinated. Either way, humans still had to read them all. And that’s the change happening underneath all the AI hype. Finding bugs is getting cheaper. Faster too. What used to take weeks of manual research can now happen in hours with the right models and enough compute. Security teams are starting to deal with something closer to triage overload than a tooling shortage. OpenAI seems to think the answer is more AI, but aimed at defenders instead of attackers. That’s where Daybreak comes in. The company says Daybreak combines its latest models, Codex Security, and a group of security partners like Cloudflare, CrowdStrike, Cisco, and Palo Alto Networks to help security teams identify vulnerabilities, validate fixes, generate patches, and monitor risky code before attackers get there first. What makes this launch interesting is that it arrives just weeks after Anthropic introduced Mythos, its own cybersecurity-focused AI system. Both companies are chasing the same problem. But they’re handling access very differently.
AutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its Own Strategy

AutoTTS: Researchers Cut Inference Tokens by 70% by Letting AI Write Its Own Strategy

0
Researchers figured out how to make AI reason more efficiently by having AI figure it out itself. By building an environment where an AI agent writes controller code, tests it, gets feedback, and rewrites it until the strategy gets better. The result cuts token usage by roughly 70% at the same accuracy as running 64 parallel reasoning chains. That's the difference between inference being affordable and inference being a cost problem. The research comes from a team across UMD, UVA, WUSTL, UNC, Google, and Meta. It's called AutoTTS, automated test-time scaling and it's one of the more conceptually interesting papers published this year even if you can't download a model and use it tomorrow.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy