Most models quit around submission 30 because they stop finding improvement and exit on their own. That’s what happened when MiniMax ran a CUDA kernel optimization task against a field of frontier models. Every model except two called it done within the first 30 submissions.
M3’s best result came on submission 145. After 24 hours. After multiple plateaus where the numbers stopped moving and a reasonable model would have concluded there was nothing left to find.
That’s the thing MiniMax released yesterday. An AI model with a 1M token context window, native multimodality, and apparently a problem with knowing when to stop.
Table of Contents
What M3 is
M3 is an open-weight model with a 1 million token context window, native multimodal support for images and video, and what MiniMax describes as frontier-level coding and agentic performance. The weights aren’t out yet. MiniMax says that’s coming in about 10 days but the API is live and MiniMax Code, their agent product built specifically around M3, is available now.
Closed models with these capabilities exist. What hasn’t existed until now is one you can actually run, inspect, and build on. MiniMax is explicit that M3 is the first open-weight model combining all three of these capabilities together: long context at this scale, native multimodality from step one of training, and agentic performance that competes with the frontier closed models.
The architecture behind the context window
A 1M token context window is only useful if the model can actually reason across it without the whole thing becoming unwieldy. Most long-context models struggle here, the attention mechanism that makes transformers work gets quadratically more expensive as context grows, and at 1M tokens that cost becomes either prohibitive or a hidden quality tradeoff.
MiniMax built a new attention architecture for M3 called MSA, Minimax Sparse Attention. The short version: instead of every token attending to every other token, MSA partitions the context into blocks and routes attention more precisely. At 1M tokens, per-token compute drops to 1/20th of what their previous model needed. Prefilling runs more than 9x faster, decoding more than 15x faster.
The reason this matters for the agentic story specifically is that long-horizon tasks generate dense, structured context fast. Every tool call, every result, every iteration adds to the pile. A model that degrades as that pile grows isn’t actually useful for 24-hour tasks regardless of what the benchmarks say. MSA is MiniMax’s answer to that specific problem, and the CUDA kernel run is arguably the best stress test they could have picked to demonstrate it.
The benchmarks
| Benchmark | Nano Banana M3 | Claude Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro | Top Performer |
| SWE Bench Pro | 59.0 | 64.3 | 58.6 | 54.2 | Claude Opus 4.7 |
| Terminal Bench 2.1 | 66.0 | 66.1 | 78.2 | 70.0 | GPT-5.5 |
| VIBE V2 | 50.1 | 55.8 | 50.5 | 28.0 | Claude Opus 4.7 |
| SVG-Bench | 63.7 | 62.3 | 58.2 | 59.2 | MiniMax M3 |
| KernelBench Hard | 28.8 | 30.7 | 20.9 | 18.6 | Claude Opus 4.7 |
| BrowseComp | 83.5 | 79.3 | 84.4 | 85.9 | Gemini 3.1 Pro |
| GDPval rubrics | 74.7 | 79.8 | 80.6 | 57.8 | GPT-5.5 |
| BankerToolBench | 76.1 | 81.3 | 70.0 | 67.0 | Claude Opus 4.7 |
| MCP Atlas | 74.2 | 77.0 | 75.3 | 69.2 | Claude Opus 4.7 |
| OSWorld-verified | 70.0 | 82.8 | 78.7 | 76.2 | Claude Opus 4.7 |
All numbers below are self-reported by MiniMax.
On Claw-Eval, which tests end-to-end autonomous agent performance, M3 scores 74.5 against Claude Opus 4.7’s 71.6 and Gemini 3.1 Pro’s 57.8. On SVG-Bench it leads the entire comparison at 63.7, ahead of Opus 4.7 at 62.3 and GPT-5.5 at 58.2. KernelBench Hard, which tests the kind of low-level optimization work the CUDA task exemplifies, has M3 at 28.8 against GPT-5.5’s 20.9 and Gemini’s 18.6, a meaningful gap. SpreadsheetBench puts it at 89.35, competitive with every closed model in the comparison.
The pattern across these isn’t “M3 beats everything.” It’s more specific than that. The benchmarks where M3 leads tend to be the ones that reward persistence, structured output, and long-context coherence. The ones where it lacks are SWE-fficiency, Apex-Agents, OSWorld, tend to favor precise single-step execution or GUI interaction. That’s a consistent profile, not a scattered one, and it matches what the CUDA story already suggested.
Limitations
OSWorld, which tests a model’s ability to operate a real desktop GUI, has M3 at 70.06 against Opus 4.7’s 82.8 and GPT-5.5’s 78.7. That’s not close. SWE-fficiency, which measures how efficiently a model solves software engineering tasks rather than just whether it solves them, has M3 at 34.8 against Opus 4.7’s 42.2. Apex-Agents loses too at 27.7 against GPT-5.5’s 41.7.
M3 is strong when the task rewards persistence and long-context coherence. It’s weaker when the task demands accurate single-step execution, especially anything involving GUI interaction or strict instruction following across many steps. MiniMax doesn’t hide this, the model card flags the agentic gaps directly.
You May Like: Open source AI agentic models built for real autonomous work
How to try it
The API is live now at MiniMax’s platform. Pricing splits at 512K tokens, standard rate below that, higher rate above for long-document and full-repository work. Thinking mode can be toggled per request.
The weights aren’t available yet. MiniMax says that’s coming within 10 days along with the technical report. For now MiniMax Code, their agent product built specifically around M3, is available as a desktop app and runs on token-based subscription plans starting at $20 a month.
Submission 145
Every other model in that CUDA test stopped making progress and exited. M3 kept going and found its best result 115 submissions later. The paper reproduction task is the same story, 12 hours, 18 commits and no human in the loop.
After getting the weights launched on Huggingface, we may get more quantized versions of this model from the community to run on consumer hardware as well.




