back to top
HomeTechGLM 5.1: The open source model that gets better the longer you...

GLM 5.1: The open source model that gets better the longer you run it

- Advertisement -

Give an AI agent a hard problem and it usually figures out the easy wins fast. After that, more time does not help. It just sits there, trying the same things.

ZhipuAI ran GLM-5.1 on a vector database optimization problem and let it go for 600 iterations. It did not run out of ideas. At iteration 50 it was sitting at roughly the same performance as the best single-session result any model had achieved. By iteration 600 it had reached 21,500 queries per second. The previous best was 3,547.

That gap is not incremental improvement. It is a different category of result. GLM-5.1 is open source, MIT licensed, and the weights are on HuggingFace right now. It works with Claude Code, vLLM, and SGLang. If you are building anything that runs agents over long tasks, this one is worth understanding.

What changed from GLM-5

GLM-5.1 is ZhipuAI’s latest open source model, released yesterday under the MIT license. It is a 744 billion parameter mixture of experts architecture with 40 billion parameters active at any time.

It is built specifically for agentic engineering, meaning long running tasks where the model needs to write code, run tests, read results, and decide what to try next without someone holding its hand. ZhipuAI describes it as a direct upgrade to GLM-5, with the main improvement being how it handles tasks that take hundreds of turns rather than just a few dozen.

The MIT license is worth pausing on. No commercial restrictions, no usage caveats, no community license that sounds open but has carve outs buried in the terms. You can use it, build on it, and ship with it.

When 50 turns is just the beginning

ZhipuAI set up a coding challenge where the model had to build a high performance vector database in Rust, optimizing for queries per second while keeping recall above 95%. The previous best result from any model in a standard 50 turn session was 3,547 QPS. Claude Opus 4.6 held that record.

GLM-5.1 did not just beat it. It kept going. After 600 iterations and over 6,000 tool calls it reached 21,500 QPS. Six times the previous best.

What is interesting is how it got there. It did not just grind the same approach harder. Around iteration 90 it switched from scanning the full corpus to cluster based search with compressed vectors, jumping to 6,400 QPS. Around iteration 240 it introduced a two stage pipeline and hit 13,400. Six distinct strategic shifts over the full run, each one identified by the model after reading its own benchmark logs.

That last part matters. It was not following instructions to try something new. It looked at its own results, decided the current approach had run out of road, and changed direction. Twice the recall dropped below the 95% threshold while it was exploring a new approach, then it corrected and pushed further.

No other model in the comparison sustained that kind of improvement over that many turns. That is the actual claim here, and it comes from ZhipuAI’s own testing so treat it as a strong signal rather than settled fact until someone runs independent evals.

How it stacks up against the big names

GLM-5.1 is not the best model on every benchmark and we are not going to pretend it is. But where it competes is interesting.

On SWE-Bench Pro, which tests real software engineering tasks, it scores 58.4. GPT-5.4 scores 57.7. Gemini 3.1 Pro scores 54.2. For an open source MIT licensed model to sit above both of those on a coding benchmark is not a small thing.

CyberGym tells a similar story. GLM-5.1 scores 68.7 against Claude Opus 4.6’s 66.6 and GPT-5.4’s 66.3. These are cybersecurity tasks requiring real technical depth, not pattern matching.

Where it trails is raw reasoning. On HLE, which is essentially an extremely hard general knowledge and reasoning test, Gemini 3.1 Pro scores 45, GPT-5.4 scores 39.8, and GLM-5.1 scores 31. That gap is there. If your use case lives in deep reasoning or graduate level problem solving, the closed models still performs better.

So based on these benchmarks data we can say that For agentic coding tasks GLM-5.1 is genuinely competitive with the best closed models available right now. For broad reasoning it is not. Knowing which category your work falls into tells you whether this model is relevant to you.

How to run it

GLM-5.1 is on HuggingFace right now in both BF16 and FP8. The FP8 version is the practical one for most people, lower memory requirements without meaningful quality loss.

For local deployment it supports vLLM, SGLang, xLLM, and KTransformers. SGLang and vLLM both have Docker images ready to pull

Full deployment instructions are in the official GitHub repository. This is not a casual local install though. 744 billion parameters needs serious hardware, we will get to that in a moment.

If you want to use it inside Claude Code, it works. Set the model name to GLM-5.1 in your settings file and it connects. Same for OpenCode, Roo Code, Cline, and most other popular coding agents.

If self hosting is not what you want then ZhipuAI has it available via their API at api.z.ai. If you are on their GLM Coding Plan it is rolling out now, though during peak hours it bills at 3x quota. Off peak is 1x through end of April as a promotional rate.

You May Like: Open source AI agentic models built for real autonomous work

Who it is for & limitations

If you are building agentic workflows, running long coding tasks, or need a model you can point at a problem and leave running, GLM-5.1 is worth serious attention. The MIT license means you can build commercially without any conversation about terms.

The hardware reality is blunt though. 744B parameters in BF16 needs multiple high end GPUs to run properly. This is not something you spin up on a single consumer card. The FP8 version helps but you are still looking at significant infrastructure. If you are an individual developer without access to a multi GPU setup, the API is your realistic path in.

The long horizon behavior we covered earlier is also self reported. ZhipuAI designed and ran those tests. Until independent researchers reproduce the 600 iteration results under different conditions, treat it as a strong signal rather than a guaranteed property of the model.

Raw reasoning is the other honest gap. If your work depends on deep multi step reasoning rather than coding and agentic tasks, the closed frontier models still lead.

Related: Trinity-Large-Thinking: the open source brain your AI agents have been missing

Not for everyone, but hard to ignore

744 billion parameters with an MIT license that actually means what it says. Competitive with GPT-5.4 and Gemini on coding tasks. An ability to keep improving on problems that would make every other model give up.

The hardware barrier is real. This is not something most developers will self host. But the API is there, Claude Code support is there, and the license means nobody is going to pull the rug on you later.

If you are building agents that need to work hard on complex tasks over long sessions, this is the most interesting open source release in a while because it changes what you expect a model to still be doing two hours into a task.

Run it if you can. Watch it if you cannot.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
AI Content Got Too Real. Now OpenAI and Nvidia Are Using Google’s Watermarking System

AI Content Got Too Real. Now OpenAI and Nvidia Are Using Google’s Watermarking System.

0
Three years ago, Google introduced a watermarking system for AI-generated content called SynthID. Nobody was required to use it. It was just Google's answer to a problem the rest of the industry hadn't fully admitted existed yet. Now OpenAI is using it. So is Nvidia. So are ElevenLabs and Kakao. And Google says SynthID has already been applied to 100 billion images and videos, plus 60,000 years worth of audio. The timing matters. AI-generated images and video have gotten good enough that the old tells, the extra fingers, the smeared text, the wrong shadows, are mostly gone. What replaces them as a detection method isn't human judgment. It's watermarking inserted into the content at the point of generation, before it ever reaches anyone's feed. SynthID is Google's bet on how that works at scale, and a growing number of the industry's biggest names are now betting alongside it.
command a plus ai model

Cohere Open-Sourced Command A+, a 218B MoE Model Built for Enterprise Agents

0
Cohere spent the past year deploying North, its enterprise AI workspace, with actual customers doing actual work. Agentic question answering over company file systems. Data analysis across spreadsheets. Multi-session memory that has to hold up in production. Command A+ is what came out of that, a model shaped by a year of watching enterprise workflows break and figuring out why. The result is a 218B mixture-of-experts model with 25B active parameters at inference time, available today on Hugging Face under Apache 2.0. It replaces five separate models in the Command A family, each of which handled one thing. This one handles all of them, and on most of the tasks those specialist models were built for, it wins.
AI Was Used to Recreate the Voices of Dead Pilots. The NTSB Responded by Locking Down Its Database

AI Was Used to Recreate the Voices of Dead Pilots. The NTSB Responded by...

0
Last year, a UPS cargo plane went down in Louisville, Kentucky. The crew didn't survive. The NTSB opened an investigation, as it does with every major crash, and added the case files to its public docket system, as it also does. Transcripts, data, findings, all of it accessible to anyone who wanted to look. What nobody thought about was the spectrogram. A spectrogram is a visual representation of sound. It takes audio signals, breaks them down into frequencies, and renders them as an image. The NTSB included one in the Flight 2976 docket because federal law prohibits it from releasing actual cockpit voice recordings. The spectrogram felt like a reasonable middle ground, you could see that audio existed without being able to hear it. Then Scott Manley, a YouTuber with a background in physics, pointed out on X that spectrograms encode enough data to work backwards from. The image wasn't just a picture of sound. It contained the sound. People ran with it. Using AI tools, they took the spectrogram and the publicly available transcript and reconstructed approximations of what the cockpit voice recorder actually captured. The voices of two pilots who died in that crash started circulating online. The NTSB shut its entire public docket system down.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy