back to top
HomeTechGLM 5.1: The open source model that gets better the longer you...

GLM 5.1: The open source model that gets better the longer you run it

- Advertisement -

Give an AI agent a hard problem and it usually figures out the easy wins fast. After that, more time does not help. It just sits there, trying the same things.

ZhipuAI ran GLM-5.1 on a vector database optimization problem and let it go for 600 iterations. It did not run out of ideas. At iteration 50 it was sitting at roughly the same performance as the best single-session result any model had achieved. By iteration 600 it had reached 21,500 queries per second. The previous best was 3,547.

That gap is not incremental improvement. It is a different category of result. GLM-5.1 is open source, MIT licensed, and the weights are on HuggingFace right now. It works with Claude Code, vLLM, and SGLang. If you are building anything that runs agents over long tasks, this one is worth understanding.

What changed from GLM-5

GLM-5.1 is ZhipuAI’s latest open source model, released yesterday under the MIT license. It is a 744 billion parameter mixture of experts architecture with 40 billion parameters active at any time.

It is built specifically for agentic engineering, meaning long running tasks where the model needs to write code, run tests, read results, and decide what to try next without someone holding its hand. ZhipuAI describes it as a direct upgrade to GLM-5, with the main improvement being how it handles tasks that take hundreds of turns rather than just a few dozen.

The MIT license is worth pausing on. No commercial restrictions, no usage caveats, no community license that sounds open but has carve outs buried in the terms. You can use it, build on it, and ship with it.

When 50 turns is just the beginning

ZhipuAI set up a coding challenge where the model had to build a high performance vector database in Rust, optimizing for queries per second while keeping recall above 95%. The previous best result from any model in a standard 50 turn session was 3,547 QPS. Claude Opus 4.6 held that record.

GLM-5.1 did not just beat it. It kept going. After 600 iterations and over 6,000 tool calls it reached 21,500 QPS. Six times the previous best.

What is interesting is how it got there. It did not just grind the same approach harder. Around iteration 90 it switched from scanning the full corpus to cluster based search with compressed vectors, jumping to 6,400 QPS. Around iteration 240 it introduced a two stage pipeline and hit 13,400. Six distinct strategic shifts over the full run, each one identified by the model after reading its own benchmark logs.

That last part matters. It was not following instructions to try something new. It looked at its own results, decided the current approach had run out of road, and changed direction. Twice the recall dropped below the 95% threshold while it was exploring a new approach, then it corrected and pushed further.

No other model in the comparison sustained that kind of improvement over that many turns. That is the actual claim here, and it comes from ZhipuAI’s own testing so treat it as a strong signal rather than settled fact until someone runs independent evals.

How it stacks up against the big names

GLM-5.1 is not the best model on every benchmark and we are not going to pretend it is. But where it competes is interesting.

On SWE-Bench Pro, which tests real software engineering tasks, it scores 58.4. GPT-5.4 scores 57.7. Gemini 3.1 Pro scores 54.2. For an open source MIT licensed model to sit above both of those on a coding benchmark is not a small thing.

CyberGym tells a similar story. GLM-5.1 scores 68.7 against Claude Opus 4.6’s 66.6 and GPT-5.4’s 66.3. These are cybersecurity tasks requiring real technical depth, not pattern matching.

Where it trails is raw reasoning. On HLE, which is essentially an extremely hard general knowledge and reasoning test, Gemini 3.1 Pro scores 45, GPT-5.4 scores 39.8, and GLM-5.1 scores 31. That gap is there. If your use case lives in deep reasoning or graduate level problem solving, the closed models still performs better.

So based on these benchmarks data we can say that For agentic coding tasks GLM-5.1 is genuinely competitive with the best closed models available right now. For broad reasoning it is not. Knowing which category your work falls into tells you whether this model is relevant to you.

How to run it

GLM-5.1 is on HuggingFace right now in both BF16 and FP8. The FP8 version is the practical one for most people, lower memory requirements without meaningful quality loss.

For local deployment it supports vLLM, SGLang, xLLM, and KTransformers. SGLang and vLLM both have Docker images ready to pull

Full deployment instructions are in the official GitHub repository. This is not a casual local install though. 744 billion parameters needs serious hardware, we will get to that in a moment.

If you want to use it inside Claude Code, it works. Set the model name to GLM-5.1 in your settings file and it connects. Same for OpenCode, Roo Code, Cline, and most other popular coding agents.

If self hosting is not what you want then ZhipuAI has it available via their API at api.z.ai. If you are on their GLM Coding Plan it is rolling out now, though during peak hours it bills at 3x quota. Off peak is 1x through end of April as a promotional rate.

You May Like: Open source AI agentic models built for real autonomous work

Who it is for & limitations

If you are building agentic workflows, running long coding tasks, or need a model you can point at a problem and leave running, GLM-5.1 is worth serious attention. The MIT license means you can build commercially without any conversation about terms.

The hardware reality is blunt though. 744B parameters in BF16 needs multiple high end GPUs to run properly. This is not something you spin up on a single consumer card. The FP8 version helps but you are still looking at significant infrastructure. If you are an individual developer without access to a multi GPU setup, the API is your realistic path in.

The long horizon behavior we covered earlier is also self reported. ZhipuAI designed and ran those tests. Until independent researchers reproduce the 600 iteration results under different conditions, treat it as a strong signal rather than a guaranteed property of the model.

Raw reasoning is the other honest gap. If your work depends on deep multi step reasoning rather than coding and agentic tasks, the closed frontier models still lead.

Related: Trinity-Large-Thinking: the open source brain your AI agents have been missing

Not for everyone, but hard to ignore

744 billion parameters with an MIT license that actually means what it says. Competitive with GPT-5.4 and Gemini on coding tasks. An ability to keep improving on problems that would make every other model give up.

The hardware barrier is real. This is not something most developers will self host. But the API is there, Claude Code support is there, and the license means nobody is going to pull the rug on you later.

If you are building agents that need to work hard on complex tasks over long sessions, this is the most interesting open source release in a while because it changes what you expect a model to still be doing two hours into a task.

Run it if you can. Watch it if you cannot.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Bonsai 8B A 1-Bit LLM That Delivers 8B-Class Performance at 1 by 14th the Size

Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

0
Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes. PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build. I want to be clear that these numbers come from PrismML's own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.
Open-Source TTS Models That Can Clone Voices

4 Open-Source TTS Models That Can Clone Voices and Actually Sound Human

0
Voice cloning used to mean expensive studio software, proprietary APIs with per-character pricing, or models so heavy they needed server infrastructure just to run. That changed quietly over the last few months. Four open source models exist right now that do something the previous generation struggled with. They do not just generate speech. They clone a voice from a short audio sample and produce output that is genuinely difficult to compare from the original speaker. The gap between open source and commercial TTS has been closing for a while. These four models suggest it has effectively closed for voice cloning specifically. Here is what each one actually does and who it is for.
VOID Model Netflix's open source AI removes objects and fixes the physics they break

VOID: Netflix’s open source AI removes objects and fixes the physics they break

0
Netflix has a visual effects budget most film studios would kill for. They do not release open source AI tools for fun. When they do ship something publicly, it is worth paying attention. VOID is their latest release. Video Object and Interaction Deletion. Point at an object in a video, and VOID removes it. Everything that object was doing to the world around it. That last part is where every other tool has failed for years. Remove a person carrying a stack of boxes and the boxes hang in mid air. Remove a chair someone is sitting on and the person hovers. The physics of the scene breaks and the edit becomes unusable. Film editors have been cleaning this up by hand since video editing existed. VOID does not just erase. It reasons about what should happen next. A vision language model looks at the scene first, identifies everything the removed object was physically affecting, and only then does the diffusion model generate what the world looks like without it. Remove the person, the boxes fall. Remove the chair, the person sits on the floor. The scene stays physically coherent.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy