back to top
HomeTechMiMo-V2.5-Pro: A Coding Model Taking On Claude Opus 4.6 and GPT-5.4

MiMo-V2.5-Pro: A Coding Model Taking On Claude Opus 4.6 and GPT-5.4

- Advertisement -

Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assembly backend, performance optimization. The whole thing. Students typically need several weeks.

MiMo-V2.5-Pro finished it in 4.3 hours. Perfect score. 233 out of 233 tests passed on a hidden test suite it had never seen. That’s a real university project and a model that scored higher than most students who spent weeks on it. Xiaomi built this, which is still a sentence that takes a moment to process.

V2.5-Pro is the next step up from MiMo-V2-Flash and its closed source for now, but Xiaomi has confirmed open source is coming for the V2.5 series. What V2.5-Pro adds over Flash is meaningful. Better long-horizon coherence, stronger agentic capabilities, and the ability to sustain complex tasks across more than a thousand tool calls without losing the thread.

That’s not a benchmark row. That’s a story. And it’s the most honest way to explain what Xiaomi thinks it has built here.

Three things it built while nobody was watching

The compiler story is the most dramatic but it’s not alone.

After the compiler, Xiaomi gave it a vaguer prompt like build a video editor. No detailed spec or anything specific. What came back after 11.5 hours and 1,868 tool calls was a working desktop application with a multi-track timeline, clip trimming, crossfades, audio mixing, and an export pipeline. The final codebase was 8,192 lines. A working product built start to finish while the humans presumably went home.

The third test went somewhere most coding benchmarks don’t touch. A graduate-level analog circuit design task specifically a Flipped-Voltage-Follower low-dropout regulator in a TSMC 180nm process. This is the kind of work that takes trained analog engineers several days. MiMo-V2.5-Pro was wired into an ngspice simulation loop, called the simulator, read the waveforms, adjusted parameters, and iterated. About an hour later every target metric was met. Line regulation improved 22 times over its own initial attempt. Load regulation improved 17 times.

What connects all three isn’t just capability. It’s discipline. The compiler had a regression at turn 512, a refactoring pass broke two tests. The model caught it, diagnosed the failure, and recovered without being told to. That kind of self-correction across hundreds of tool calls is what separates a model that can code from one that can actually finish something.

What the benchmarks say

BenchmarkMiMo-V2.5-ProClaude Opus 4.6GPT-5.4Gemini 3.1 Pro
SWE-Bench Pro57.257.357.754.2
MiMo Coding Bench73.777.175.167.8
Claw-Eval pass^363.870.460.357.8
HLE48.053.058.751.4

These are self-reported numbers from Xiaomi but the pattern is honest though. On SWE-Bench Pro it sits within half a point of Claude Opus 4.6 and GPT-5.4. On Claw-Eval it beats GPT-5.4 and Gemini 3.1 Pro. Where it trails is on HLE, the knowledge-heavy reasoning test where the frontier closed models still have a clear win. This is a coding-first model and the benchmarks reflect that accurately.

One thing worth noting. MiMo Coding Bench is Xiaomi’s own internal evaluation suite. The numbers there should be read with that in mind.

You May Like: Open Source LLMs That Rival ChatGPT and Claude

The token efficiency nobody is talking about

Benchmark scores get all the attention. Token costs are what actually determine whether developers adopt a model in production.

On ClawEval, MiMo-V2.5-Pro hits 64% Pass@3 using roughly 70,000 tokens per trajectory. Xiaomi claims that’s 40-60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 reach comparable scores with. These are self-reported figures so treat them directionally, but if accurate the cost difference at production scale is significant.

That gap compounds fast. If you’re running hundreds of agentic tasks a day, the difference between 70K tokens and 120K tokens per trajectory isn’t marginal. It’s the difference between a workflow that’s economically viable and one that isn’t.

This is the part of MiMo-V2.5-Pro that doesn’t show up in the headline comparisons but matters most to anyone building on top of it.

How to try it and what open source means

MiMo-V2.5-Pro is in public beta on Xiaomi’s API Platform and AI Studio right now. No waitlist. Switch the model tag to mimo-v2.5-pro and you’re using it. Pricing hasn’t changed from V2-Pro.

It works with agentic scaffolds like Claude Code, OpenCode, Kilo all supported. For complex long-horizon tasks Xiaomi recommends pairing it with a proper harness, which the compiler and video editor demos made obvious.

On open source, Xiaomi confirmed the V2.5 series will be released publicly. No specific date. We’ll flag it when weights drop. Until then the API is the only way in, and honestly for most use cases that’s the practical route anyway.

Who should care

Developers running agentic coding workflows at any volume. The token efficiency story alone makes it worth evaluating against what you’re currently using.

Anyone building complex software autonomously, the compiler and video editor demos are evidence of something real about long-horizon coherence that most models still struggle with.

And if you’ve been waiting for a frontier-competitive open source coding model, watch this one. The weights aren’t here yet but Xiaomi has a track record of following through. When they drop, this will be one of the more capable open models available for coding work.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE

Qwen3.6-27B: The Open Source Coding Model That Punches Way Above Its Size

0
There's a quiet assumption baked into how most people think about AI models. Bigger means better. More parameters means more capable. If you want the best results, you run the biggest thing you can afford. Qwen3.6-27B makes that assumption uncomfortable. It's a 27B dense model, fully open source under Apache 2.0, and on agentic coding benchmarks it beats Qwen3.5-397B — a model nearly fifteen times its size — across every major test. That's not a rounding error or a cherry-picked metric. It's a consistent pattern across SWE-Bench, Terminal-Bench, and frontend code generation. This doesn't mean bigger models are dead. It means the gap between what you can run locally and what only clusters could handle a year ago just got a lot narrower.
Kimi K2.6 Turn Your Documents Into Reusable Skills

Kimi K2.6: Turn Your Documents Into Reusable Skills and Let 50+ Agents Execute Them

0
There's a particular kind of frustration that comes with doing great work and then starting from scratch the next time you need to do it again. You wrote a brilliant research report last month. The structure was tight, the sourcing was solid, the tone was exactly right. Now a client wants something similar and you're staring at a blank page again. The previous report is sitting in a folder somewhere, useful as a reference but not as a tool. Kimi K2.6 is trying to fix that specific problem. And the way it goes about it is different enough from what other models are doing that it's worth paying attention to. The model itself is a 1T parameter MoE released under a Modified MIT license, more on what that means practically in a moment. But the architecture is almost secondary to what Moonshot AI built around it. Document to Skills, Agent Swarm, full stack generation from a single prompt. It's a system designed around the idea that one person should be able to operate like a team.
OpenMythos

OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

0
Anthropic hasn't told anyone how Claude Mythos works. No architecture paper or model card with details. Just a product that keeps surprising people and a company that stays quiet about why. That silence has been driving the research community a little crazy. So one developer Kye Gomez did something about it. He read every public paper he could find on recurrent transformers, looped architectures, and inference-time scaling. He studied the behavioral patterns people were reporting from Mythos. Then he built what he thinks is inside it, published the code under MIT, and made it pip installable. It's called OpenMythos. It is not Claude Mythos. Gomez is explicit about that but the hypothesis behind it is serious, the architecture is real, and the reasoning for why Mythos might work this way is harder to dismiss than you'd expect.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy