back to top
HomeTechMistral Just Replaced Three of Its Own Models With One. Meet Medium...

Mistral Just Replaced Three of Its Own Models With One. Meet Medium 3.5

- Advertisement -

Mistral has been shipping specialized models for a while now. One for coding. One for reasoning. One for chat. Each one doing its thing separately and requiring a different deployment decision.

Medium 3.5 ends that confusion. One 128B dense model, one set of weights, handling instruction following, reasoning, and coding together. Mistral didn’t just release a new model, they retired three existing ones to make room for it. Devstral 2, Magistral and even Medium 3.1 is gone. Medium 3.5 is what replaced all of them.

That’s either a sign of real confidence or a very expensive consolidation bet. Looking at the benchmarks, it’s starting to look like the former.

What they actually built

128B parameters, dense architecture, no mixture of experts routing. That’s a deliberate choice in an era where everyone is going sparse. Dense means predictable, same compute per token every time, no routing overhead, no expert load balancing headaches in production.

The context window sits at 256K tokens. Vision is built in. Mistral trained the encoder from scratch to handle variable image sizes and aspect ratios rather than bolting on an existing one. Reasoning effort is configurable per request, meaning the same model handles a quick chat reply and a complex multi-step agentic run without switching between deployments.

But before anything else: the license is a modified MIT, not standard MIT. Commercial use is allowed but companies above a certain revenue threshold hit exceptions. Mistral has been loose with the word open source here and that’s worth knowing before you build on it. More on that later.

The three models it replaced and why

Devstral 2 was Mistral’s dedicated coding agent. It lived inside their Vibe CLI and handled agentic coding tasks specifically. Medium 3.5 now powers Vibe instead and according to Mistral’s own benchmarks, it outperforms Devstral 2 across every agentic benchmark they ran.

Magistral was their reasoning model. The one you reached for when you needed the model to think carefully through a problem. Medium 3.5 replaces it in Le Chat with configurable reasoning effort that toggles between fast response mode and deep reasoning mode per request. Same deployment, two behaviors.

Medium 3.1 was the previous generation of this exact tier. The internal comparison tells that cleanly. TAU3 Telecom went from 60.5 on Medium 1.2 to 91.4 on Medium 3.5. TAU3 Airline from 53.5 to 72.0. TAU3 Retail from 70.2 to 76.1. These are the kind of numbers that make you question why you’d keep the old version around at all.

Mistral retire those models because Medium 3.5 made them redundant.

Related: Mistral Small 4: The Open Source Model Replacing Three of Mistral’s Own AI Models

The agentic numbers

Agentic Benchmarks of Mistral Medium 3.5 with competing models

SWE-bench Verified at 77.6 is the headline coding result. For context Claude Sonnet 4.6 sits at 79.6, Claude Sonnet 4.5 at 77.2, and Kimi K2.5 at 76.8. Medium 3.5 is in that conversation, not leading it but genuinely competitive with models that get significantly more attention.

Where it pulls ahead is TAU3 Telecom. 91.4 against GLM-5.1’s 98.7 and Qwen3.5’s 97.8 those two lead, Medium 3.5 is third. On TAU3 Airline it scores 72.0, matching Claude Sonnet 4.5 exactly. TAU3 Retail at 76.1 sits between Sonnet 4.5 at 72.4 and Sonnet 4.6 at 75.9.

BrowseComp is where it gets more honest. 48.6 against Kimi K2.5’s 74.7 and Claude Sonnet 4.5’s 43.9. Ahead of Sonnet 4.5, well behind Kimi. TAU3 Banking at 13.4 is the weakest number. Claude Sonnet 4.5 at 22.4 and Kimi at 14.9 both tell you this category is hard across the board, but 13.4 is still the bottom of that chart.

Medium 3.5 competes in agentic coding and multi-step execution. It doesn’t dominate. It earns its place in the conversation without being the clear winner of it.

Benchmarks

BenchmarkWhat it testsMistral Medium 3.5Claude Sonnet 4.6Kimi K2.5GLM-5.1
SWE-bench VerifiedReal GitHub issue resolution77.679.676.880.2
TAU3 TelecomMulti-step agent execution91.470.486.898.7
TAU3 AirlineMulti-step agent execution72.083.076.579.5
TAU3 RetailMulti-step agent execution76.175.972.876.3
TAU3 BankingMulti-step agent execution13.428.414.916.2
BrowseCompWeb browsing and research48.643.974.774.9

All comparisons are against Mistral’s own published benchmarks. Self-reported results, standard condition applies.

One thing to note that TAU3 Banking at 13.4 is low across every model on this chart. That’s less a Mistral problem and more a signal that banking domain agent tasks are genuinely hard for current models regardless of who built them.

You May Like: SenseNova-U1: Open Source AI That Understands and Generates Images in One Model

The license question

The actual license is a modified MIT with revenue exceptions for companies above a certain size. That’s not standard MIT and it’s not Apache 2.0.

For individual developers, small teams, and most startups this won’t matter day to day. Commercial use is allowed within the license terms. But if you’re at a company with significant revenue and you’re planning to build production systems on top of this, read the full modified MIT license before committing. The exceptions are real and the definition of what triggers them matters.

Mistral has been consistently loose with the “open source” label across their releases. This one follows that pattern. Open weight with a permissive custom license is the accurate description. Worth knowing before you architect around it.

How to try it

Available on Ollama for the quickest start. GGUFs from Unsloth work through llama.cpp and LM Studio support is listed as work in progress. For production serving vLLM and SGLang both have day-one support. EAGLE speculative decoding is available for both to speed up local inference if latency matters.

For most developers the Mistral API is the practical starting point before committing to local infrastructure. The model powers Le Chat now so you can get a feel for it there without any setup.

One important note if you’re using GGUFs: there was a bug in the original Transformers config that caused long-context performance degradation. Make sure any GGUF you download was generated after the fix was merged. Older files will give you subpar results especially on long sessions.

Who should care

If you were already using Devstral 2 for agentic coding this is your direct upgrade. Same workflow, better numbers, one fewer deployment to maintain.

If you were using Magistral for reasoning tasks, Medium 3.5 covers that with configurable effort. The convenience argument alone is worth evaluating.

If you’re building production agentic systems and need a model that handles coding, reasoning, and instruction following in one deployment without managing multiple specialized models, this is the most direct answer Mistral has given to that problem.

128B dense is not consumer hardware. You’re looking at serious GPU infrastructure or the Mistral API for anything production grade. If that’s not your situation, the model is still accessible via Ollama for evaluation but local inference at full precision needs decent hardware.

Also if revenue thresholds matter to your legal team, get the modified MIT terms reviewed before building on this.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
glm 5.2 ai open weights

GLM-5.2 Is the Closest an Open Model Has Come to Claude

0
What does it take for an open-weight model to stop chasing Claude and actually beat it? Every open-weight release for two years has told some version of the same story: closer, but not quite. The chart shrinks, the wording softens to "competitive with," and the conversation moves on until the next model repeats the cycle. GLM-5.2 breaks that pattern. The model is built to survive long, messy coding work, the kind that runs for hours without losing the thread. That's the pitch its maker is leading with. But scroll down their own benchmark table and something else is sitting there quietly: on a couple of standard math evals, this open model isn't approaching Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro. It's beating all three, on the same table. It loses plenty of ground elsewhere, and that part matters just as much as the wins. But a model anyone can download under an MIT license, with no usage restrictions attached, coming out ahead of the lab everyone else measures themselves against, is worth pausing on before getting to what the rest of the numbers actually say.
Open-Source AI Tools Worth Trying Right Now

5 Open-Source AI Tools You Probably Haven’t Tried Yet

0
Every week brings another open source AI release, and most of them require setting up a Python environment. Find out the model card lied about VRAM requirements. By the time something actually runs, the appeal has mostly worn off. The five tools below skip most of that. One turns image and video generation into something closer to a desktop app. One gives DeepSeek an actual workspace instead of a browser tab. One builds UI prototypes using coding agents you probably already have installed. One quietly builds a memory system out of your own apps. And one is, literally, a desktop pet.
Claude Mythos 5 and Claude Fable 5

Claude Mythos 5 Was Too Powerful to Ship. Anthropic Released Fable 5 Instead.

0
Anthropic gave stripe early access to Fable 5 and set it loose on a 50 million line Ruby codebase. The migration that would have taken a full engineering team over two months got done in a day. That's a real company's real codebase and a task with real consequences if it goes wrong. Anthropic leads with it because it's the kind of result that's hard to argue with & because it sets up everything else they need to tell you about why this launch looks the way it does. Because here's the thing. The model Anthropic actually built Claude Mythos 5, isn't what most people are getting today. What's going live for general use is Claude Fable 5. Same underlying model. Different version. The parts Anthropic decided were too dangerous for public release got a separate wrapper, a separate name, and a separate approval process controlled in part by the US government.