MiniMax handed an internal version of M2.7 a programming scaffold and let it run unsupervised. Over 100 rounds it analyzed its own failures, modified its own code, ran evaluations, and decided what to keep and what to revert. The result was a 30% performance improvement with nobody directing each step. That is not a benchmark result. That is a different way of thinking about how AI models get built.
M2.7 is now available on HuggingFace with weights you can download and deploy. NVIDIA is offering free API access if you want to try it without the hardware overhead. The license has a commercial limitation worth knowing about, we will get to that.
Table of Contents
What self evolution actually means here
MiniMax used M2.7 during its own development to update memory, build skills for reinforcement learning experiments, and improve its own learning process based on experiment results. The model was a participant in its own training pipeline.
The clearest demonstration is the MLE Bench Lite result. MiniMax gave M2.7 access to 22 machine learning competitions, each runnable on a single A30 GPU, and let it run three 24 hour trials with a simple harness built around short term memory, self feedback, and self optimization. After each round the model generated a memory file, criticized its own results, and fed those observations into the next round.
The best run achieved 9 gold medals, 5 silver medals, and 1 bronze across those 22 competitions. The average medal rate across all three trials was 66.6%, second only to Opus 4.6 at 75.7% and GPT-5.4 at 71.2%.
What makes this interesting is not the medal count. It is that the improvement was continuous across all three 24 hour windows. The model kept finding better approaches the longer it ran, which connects directly to the long horizon behavior that makes agentic models actually useful in production.
What M2.7 can do
The benchmark that matters most for developers is SWE-Pro, which tests real software engineering across multiple programming languages. M2.7 scores 56.22%, matching GPT-5.3-Codex. On SWE Multilingual it scores 76.5 and on Multi SWE Bench 52.7, both of which test closer to real world engineering scenarios.
It can correlate monitoring metrics with deployment timelines, run statistical analysis on trace data, connect to databases to verify root causes, and make SRE level decisions about how to stop the bleeding before submitting a fix. MiniMax claims it has reduced live production incident recovery time to under three minutes on multiple occasions using M2.7.
On VIBE-Pro, which tests end to end full project delivery across web, Android, iOS, and simulation tasks, M2.7 scores 55.6%, close to Opus 4.6. That means you can hand it a complete project requirement and expect something usable back.
Native Agent Teams support is the other practical capability. The model can maintain stable role identity across multi-agent setups, make autonomous decisions within complex state machines, and challenge other agents on logical gaps. That is not prompt engineering, it is internalized behavior.
The office and productivity angle
Software engineering gets most of the attention with agentic models but M2.7 has a serious productivity story. On GDPval-AA, which measures professional task delivery across real office scenarios, M2.7 scores an ELO of 1495. That is the highest among open source models and sits above GPT-5.3, though Opus 4.6, Sonnet 4.6, and GPT-5.4 still lead it.
The practical capability is in document work. M2.7 handles Word, Excel, and PPT with multi-round high fidelity editing, meaning you can give it an existing file, ask for revisions across multiple interactions, and get back something editable. MiniMax demonstrated this with a TSMC financial analysis task where the model read annual reports, cross referenced research, built a revenue forecast model, and produced a finished PPT and Word report. Their own finance practitioners called the output usable as a first draft.
On Toolathon it scores 46.3%, which puts it in the global top tier for tool use accuracy. It maintains 97% skill compliance across 40 complex skills on MM Claw, each skill exceeding 2,000 tokens. That last number matters for anyone building agent workflows with large complex skill libraries.
Related: GLM 5.1: The open source model that gets better the longer you run it
License: what open source actually means here
This is the part to read carefully before building anything on M2.7.
The license looks MIT at first glance but it is not MIT. Non commercial use is free with no restrictions. Commercial use requires prior written authorization from MiniMax. You need to contact [email protected] and get approval before shipping any product that uses M2.7 or charges users for access to it.
There is also a display requirement. Any commercial use must prominently show “Built with MiniMax M2.7” on a related website, interface, or documentation.
For researchers, students, hobbyists, and anyone experimenting locally, none of this affects you. For developers building commercial products, get in touch with MiniMax before you ship. The weights are available and the model is genuinely capable, just go in with clear eyes about what the license actually permits.
How to try it today
The fastest way is NVIDIA’s free API access. No local setup or hardware requirements, just an API key and you are talking to M2.7 immediately. If you want to evaluate it before committing to anything, start here.
For local deployment the weights are on HuggingFace. SGLang is the recommended inference framework, with vLLM and Transformers also supported. Be honest with yourself about the hardware requirements before going this route, this is a large model and local deployment needs serious infrastructure.
MiniMax Agent at agent.minimax.io gives you a hosted interface if you want to test the agentic capabilities without any setup at all. The API platform at platform.minimax.io is the developer path for anyone building on top of it within the license terms.
Top tier AI in your hands
M2.7 is one of the more capable agentic models available with public weights right now. The self evolution story is not just interesting backstory, it shows up in the benchmark results and in the kind of sustained improvement over long running tasks that most models cannot maintain.
The software engineering numbers are competitive with the best closed models. The office productivity angle is genuinely useful for teams doing real document work. The 66.6% medal rate on MLE Bench Lite, achieved autonomously over 24 hour windows, tells you something real about how this model behaves when you give it a hard problem and step back.
The ceiling on what you can do with M2.7 is genuinely high. The question is whether your use case fits within the terms.




