Peking University gives its computer science students a compiler project every semester. Build a complete SysY compiler in Rust including lexer, parser, abstract syntax tree, IR code generation, assembly backend, performance optimization. The whole thing. Students typically need several weeks.
MiMo-V2.5-Pro finished it in 4.3 hours. Perfect score. 233 out of 233 tests passed on a hidden test suite it had never seen. That’s a real university project and a model that scored higher than most students who spent weeks on it. Xiaomi built this, which is still a sentence that takes a moment to process.
V2.5-Pro is the next step up from MiMo-V2-Flash and its closed source for now, but Xiaomi has confirmed open source is coming for the V2.5 series. What V2.5-Pro adds over Flash is meaningful. Better long-horizon coherence, stronger agentic capabilities, and the ability to sustain complex tasks across more than a thousand tool calls without losing the thread.
That’s not a benchmark row. That’s a story. And it’s the most honest way to explain what Xiaomi thinks it has built here.
Table of Contents
Three things it built while nobody was watching
The compiler story is the most dramatic but it’s not alone.
After the compiler, Xiaomi gave it a vaguer prompt like build a video editor. No detailed spec or anything specific. What came back after 11.5 hours and 1,868 tool calls was a working desktop application with a multi-track timeline, clip trimming, crossfades, audio mixing, and an export pipeline. The final codebase was 8,192 lines. A working product built start to finish while the humans presumably went home.
The third test went somewhere most coding benchmarks don’t touch. A graduate-level analog circuit design task specifically a Flipped-Voltage-Follower low-dropout regulator in a TSMC 180nm process. This is the kind of work that takes trained analog engineers several days. MiMo-V2.5-Pro was wired into an ngspice simulation loop, called the simulator, read the waveforms, adjusted parameters, and iterated. About an hour later every target metric was met. Line regulation improved 22 times over its own initial attempt. Load regulation improved 17 times.
What connects all three isn’t just capability. It’s discipline. The compiler had a regression at turn 512, a refactoring pass broke two tests. The model caught it, diagnosed the failure, and recovered without being told to. That kind of self-correction across hundreds of tool calls is what separates a model that can code from one that can actually finish something.
What the benchmarks say
| Benchmark | MiMo-V2.5-Pro | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 57.2 | 57.3 | 57.7 | 54.2 |
| MiMo Coding Bench | 73.7 | 77.1 | 75.1 | 67.8 |
| Claw-Eval pass^3 | 63.8 | 70.4 | 60.3 | 57.8 |
| HLE | 48.0 | 53.0 | 58.7 | 51.4 |
These are self-reported numbers from Xiaomi but the pattern is honest though. On SWE-Bench Pro it sits within half a point of Claude Opus 4.6 and GPT-5.4. On Claw-Eval it beats GPT-5.4 and Gemini 3.1 Pro. Where it trails is on HLE, the knowledge-heavy reasoning test where the frontier closed models still have a clear win. This is a coding-first model and the benchmarks reflect that accurately.
One thing worth noting. MiMo Coding Bench is Xiaomi’s own internal evaluation suite. The numbers there should be read with that in mind.
You May Like: Open Source LLMs That Rival ChatGPT and Claude
The token efficiency nobody is talking about
Benchmark scores get all the attention. Token costs are what actually determine whether developers adopt a model in production.
On ClawEval, MiMo-V2.5-Pro hits 64% Pass@3 using roughly 70,000 tokens per trajectory. Xiaomi claims that’s 40-60% fewer tokens than Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 reach comparable scores with. These are self-reported figures so treat them directionally, but if accurate the cost difference at production scale is significant.
That gap compounds fast. If you’re running hundreds of agentic tasks a day, the difference between 70K tokens and 120K tokens per trajectory isn’t marginal. It’s the difference between a workflow that’s economically viable and one that isn’t.
This is the part of MiMo-V2.5-Pro that doesn’t show up in the headline comparisons but matters most to anyone building on top of it.
Related: Xiaomi Quietly Released an AI Model That Challenges DeepSeek Here’s Why It Matters
How to try it and what open source means
MiMo-V2.5-Pro is in public beta on Xiaomi’s API Platform and AI Studio right now. No waitlist. Switch the model tag to mimo-v2.5-pro and you’re using it. Pricing hasn’t changed from V2-Pro.
It works with agentic scaffolds like Claude Code, OpenCode, Kilo all supported. For complex long-horizon tasks Xiaomi recommends pairing it with a proper harness, which the compiler and video editor demos made obvious.
On open source, Xiaomi confirmed the V2.5 series will be released publicly. No specific date. We’ll flag it when weights drop. Until then the API is the only way in, and honestly for most use cases that’s the practical route anyway.
Who should care
Developers running agentic coding workflows at any volume. The token efficiency story alone makes it worth evaluating against what you’re currently using.
Anyone building complex software autonomously, the compiler and video editor demos are evidence of something real about long-horizon coherence that most models still struggle with.
And if you’ve been waiting for a frontier-competitive open source coding model, watch this one. The weights aren’t here yet but Xiaomi has a track record of following through. When they drop, this will be one of the more capable open models available for coding work.




