back to top
HomeTechMistral Just Replaced Three of Its Own Models With One. Meet Medium...

Mistral Just Replaced Three of Its Own Models With One. Meet Medium 3.5

- Advertisement -

Mistral has been shipping specialized models for a while now. One for coding. One for reasoning. One for chat. Each one doing its thing separately and requiring a different deployment decision.

Medium 3.5 ends that confusion. One 128B dense model, one set of weights, handling instruction following, reasoning, and coding together. Mistral didn’t just release a new model, they retired three existing ones to make room for it. Devstral 2, Magistral and even Medium 3.1 is gone. Medium 3.5 is what replaced all of them.

That’s either a sign of real confidence or a very expensive consolidation bet. Looking at the benchmarks, it’s starting to look like the former.

What they actually built

128B parameters, dense architecture, no mixture of experts routing. That’s a deliberate choice in an era where everyone is going sparse. Dense means predictable, same compute per token every time, no routing overhead, no expert load balancing headaches in production.

The context window sits at 256K tokens. Vision is built in. Mistral trained the encoder from scratch to handle variable image sizes and aspect ratios rather than bolting on an existing one. Reasoning effort is configurable per request, meaning the same model handles a quick chat reply and a complex multi-step agentic run without switching between deployments.

But before anything else: the license is a modified MIT, not standard MIT. Commercial use is allowed but companies above a certain revenue threshold hit exceptions. Mistral has been loose with the word open source here and that’s worth knowing before you build on it. More on that later.

The three models it replaced and why

Devstral 2 was Mistral’s dedicated coding agent. It lived inside their Vibe CLI and handled agentic coding tasks specifically. Medium 3.5 now powers Vibe instead and according to Mistral’s own benchmarks, it outperforms Devstral 2 across every agentic benchmark they ran.

Magistral was their reasoning model. The one you reached for when you needed the model to think carefully through a problem. Medium 3.5 replaces it in Le Chat with configurable reasoning effort that toggles between fast response mode and deep reasoning mode per request. Same deployment, two behaviors.

Medium 3.1 was the previous generation of this exact tier. The internal comparison tells that cleanly. TAU3 Telecom went from 60.5 on Medium 1.2 to 91.4 on Medium 3.5. TAU3 Airline from 53.5 to 72.0. TAU3 Retail from 70.2 to 76.1. These are the kind of numbers that make you question why you’d keep the old version around at all.

Mistral retire those models because Medium 3.5 made them redundant.

Related: Mistral Small 4: The Open Source Model Replacing Three of Mistral’s Own AI Models

The agentic numbers

Agentic Benchmarks of Mistral Medium 3.5 with competing models

SWE-bench Verified at 77.6 is the headline coding result. For context Claude Sonnet 4.6 sits at 79.6, Claude Sonnet 4.5 at 77.2, and Kimi K2.5 at 76.8. Medium 3.5 is in that conversation, not leading it but genuinely competitive with models that get significantly more attention.

Where it pulls ahead is TAU3 Telecom. 91.4 against GLM-5.1’s 98.7 and Qwen3.5’s 97.8 those two lead, Medium 3.5 is third. On TAU3 Airline it scores 72.0, matching Claude Sonnet 4.5 exactly. TAU3 Retail at 76.1 sits between Sonnet 4.5 at 72.4 and Sonnet 4.6 at 75.9.

BrowseComp is where it gets more honest. 48.6 against Kimi K2.5’s 74.7 and Claude Sonnet 4.5’s 43.9. Ahead of Sonnet 4.5, well behind Kimi. TAU3 Banking at 13.4 is the weakest number. Claude Sonnet 4.5 at 22.4 and Kimi at 14.9 both tell you this category is hard across the board, but 13.4 is still the bottom of that chart.

Medium 3.5 competes in agentic coding and multi-step execution. It doesn’t dominate. It earns its place in the conversation without being the clear winner of it.

Benchmarks

BenchmarkWhat it testsMistral Medium 3.5Claude Sonnet 4.6Kimi K2.5GLM-5.1
SWE-bench VerifiedReal GitHub issue resolution77.679.676.880.2
TAU3 TelecomMulti-step agent execution91.470.486.898.7
TAU3 AirlineMulti-step agent execution72.083.076.579.5
TAU3 RetailMulti-step agent execution76.175.972.876.3
TAU3 BankingMulti-step agent execution13.428.414.916.2
BrowseCompWeb browsing and research48.643.974.774.9

All comparisons are against Mistral’s own published benchmarks. Self-reported results, standard condition applies.

One thing to note that TAU3 Banking at 13.4 is low across every model on this chart. That’s less a Mistral problem and more a signal that banking domain agent tasks are genuinely hard for current models regardless of who built them.

You May Like: SenseNova-U1: Open Source AI That Understands and Generates Images in One Model

The license question

The actual license is a modified MIT with revenue exceptions for companies above a certain size. That’s not standard MIT and it’s not Apache 2.0.

For individual developers, small teams, and most startups this won’t matter day to day. Commercial use is allowed within the license terms. But if you’re at a company with significant revenue and you’re planning to build production systems on top of this, read the full modified MIT license before committing. The exceptions are real and the definition of what triggers them matters.

Mistral has been consistently loose with the “open source” label across their releases. This one follows that pattern. Open weight with a permissive custom license is the accurate description. Worth knowing before you architect around it.

How to try it

Available on Ollama for the quickest start. GGUFs from Unsloth work through llama.cpp and LM Studio support is listed as work in progress. For production serving vLLM and SGLang both have day-one support. EAGLE speculative decoding is available for both to speed up local inference if latency matters.

For most developers the Mistral API is the practical starting point before committing to local infrastructure. The model powers Le Chat now so you can get a feel for it there without any setup.

One important note if you’re using GGUFs: there was a bug in the original Transformers config that caused long-context performance degradation. Make sure any GGUF you download was generated after the fix was merged. Older files will give you subpar results especially on long sessions.

Who should care

If you were already using Devstral 2 for agentic coding this is your direct upgrade. Same workflow, better numbers, one fewer deployment to maintain.

If you were using Magistral for reasoning tasks, Medium 3.5 covers that with configurable effort. The convenience argument alone is worth evaluating.

If you’re building production agentic systems and need a model that handles coding, reasoning, and instruction following in one deployment without managing multiple specialized models, this is the most direct answer Mistral has given to that problem.

128B dense is not consumer hardware. You’re looking at serious GPU infrastructure or the Mistral API for anything production grade. If that’s not your situation, the model is still accessible via Ollama for evaluation but local inference at full precision needs decent hardware.

Also if revenue thresholds matter to your legal team, get the modified MIT terms reviewed before building on this.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Ling 2.6 Came Out of Nowhere and It's Competing With GPT-5.4 on Agentic Tasks

Ant Group’s Ling 2.6 Came Out of Nowhere and It’s Competing With GPT-5.4 on...

0
Ant Group doesn't get the coverage it deserves. While the open source AI conversation in the West circles around DeepSeek and Qwen, Ant Group has been quietly building a model family that competes directly with the models everyone is talking about. Ling 2.6 is the latest. Two variants, a trillion parameter flagship and a lean 104B flash model with 7.4B active parameters. Both MIT licensed. Both free to try on OpenRouter right now. Most people haven't heard of it. The benchmarks suggest they should have.
NVIDIA Built Nemotron 3 Nano Omni to Handle Everything. Here’s the Catch

NVIDIA Built Nemotron 3 Nano Omni to Handle Everything. Here’s the Catch

0
NVIDIA already controls the hardware most AI models run on. Now they want a say in which models run on that hardware too. Nemotron 3 Nano Omni is their latest move in that direction. It's an omnimodal model that can handle text, images, video, and audio natively in one architecture. The 30B total parameter count with 3B active makes it approachable for serious deployment without needing heavy hardware. The architecture underneath it is genuinely unusual. And the benchmark numbers on document intelligence and video understanding are strong enough to take seriously. But there is a catch. Actually there are a few.
sensenova u1 multimodal opensource

SenseNova-U1: Open Source AI That Understands and Generates Images in One Model

0
Most multimodal models are text models with image handling bolted on. A vision encoder reads the image, converts it into tokens the language model understands, and the two systems communicate through that translation layer. It works. It's also where things break down when text and image content need to stay tightly in sync. SenseNova-U1 takes a different approach. Released by SenseTime under Apache 2.0, it removes the visual encoder and VAE entirely. No translation layer or separate systems. Pixel and word information modeled together from the start. The technical report isn't out yet and the A3B variant is still pending. But the 8B weights are available now.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy