Give an AI agent a hard problem and it usually figures out the easy wins fast. After that, more time does not help. It just sits there, trying the same things.
ZhipuAI ran GLM-5.1 on a vector database optimization problem and let it go for 600 iterations. It did not run out of ideas. At iteration 50 it was sitting at roughly the same performance as the best single-session result any model had achieved. By iteration 600 it had reached 21,500 queries per second. The previous best was 3,547.
That gap is not incremental improvement. It is a different category of result. GLM-5.1 is open source, MIT licensed, and the weights are on HuggingFace right now. It works with Claude Code, vLLM, and SGLang. If you are building anything that runs agents over long tasks, this one is worth understanding.
Table of Contents
What changed from GLM-5
GLM-5.1 is ZhipuAI’s latest open source model, released yesterday under the MIT license. It is a 744 billion parameter mixture of experts architecture with 40 billion parameters active at any time.
It is built specifically for agentic engineering, meaning long running tasks where the model needs to write code, run tests, read results, and decide what to try next without someone holding its hand. ZhipuAI describes it as a direct upgrade to GLM-5, with the main improvement being how it handles tasks that take hundreds of turns rather than just a few dozen.
The MIT license is worth pausing on. No commercial restrictions, no usage caveats, no community license that sounds open but has carve outs buried in the terms. You can use it, build on it, and ship with it.
When 50 turns is just the beginning
ZhipuAI set up a coding challenge where the model had to build a high performance vector database in Rust, optimizing for queries per second while keeping recall above 95%. The previous best result from any model in a standard 50 turn session was 3,547 QPS. Claude Opus 4.6 held that record.
GLM-5.1 did not just beat it. It kept going. After 600 iterations and over 6,000 tool calls it reached 21,500 QPS. Six times the previous best.
What is interesting is how it got there. It did not just grind the same approach harder. Around iteration 90 it switched from scanning the full corpus to cluster based search with compressed vectors, jumping to 6,400 QPS. Around iteration 240 it introduced a two stage pipeline and hit 13,400. Six distinct strategic shifts over the full run, each one identified by the model after reading its own benchmark logs.
That last part matters. It was not following instructions to try something new. It looked at its own results, decided the current approach had run out of road, and changed direction. Twice the recall dropped below the 95% threshold while it was exploring a new approach, then it corrected and pushed further.
No other model in the comparison sustained that kind of improvement over that many turns. That is the actual claim here, and it comes from ZhipuAI’s own testing so treat it as a strong signal rather than settled fact until someone runs independent evals.
How it stacks up against the big names
GLM-5.1 is not the best model on every benchmark and we are not going to pretend it is. But where it competes is interesting.
On SWE-Bench Pro, which tests real software engineering tasks, it scores 58.4. GPT-5.4 scores 57.7. Gemini 3.1 Pro scores 54.2. For an open source MIT licensed model to sit above both of those on a coding benchmark is not a small thing.
CyberGym tells a similar story. GLM-5.1 scores 68.7 against Claude Opus 4.6’s 66.6 and GPT-5.4’s 66.3. These are cybersecurity tasks requiring real technical depth, not pattern matching.
Where it trails is raw reasoning. On HLE, which is essentially an extremely hard general knowledge and reasoning test, Gemini 3.1 Pro scores 45, GPT-5.4 scores 39.8, and GLM-5.1 scores 31. That gap is there. If your use case lives in deep reasoning or graduate level problem solving, the closed models still performs better.
So based on these benchmarks data we can say that For agentic coding tasks GLM-5.1 is genuinely competitive with the best closed models available right now. For broad reasoning it is not. Knowing which category your work falls into tells you whether this model is relevant to you.
How to run it
GLM-5.1 is on HuggingFace right now in both BF16 and FP8. The FP8 version is the practical one for most people, lower memory requirements without meaningful quality loss.
For local deployment it supports vLLM, SGLang, xLLM, and KTransformers. SGLang and vLLM both have Docker images ready to pull
Full deployment instructions are in the official GitHub repository. This is not a casual local install though. 744 billion parameters needs serious hardware, we will get to that in a moment.
If you want to use it inside Claude Code, it works. Set the model name to GLM-5.1 in your settings file and it connects. Same for OpenCode, Roo Code, Cline, and most other popular coding agents.
If self hosting is not what you want then ZhipuAI has it available via their API at api.z.ai. If you are on their GLM Coding Plan it is rolling out now, though during peak hours it bills at 3x quota. Off peak is 1x through end of April as a promotional rate.
You May Like: Open source AI agentic models built for real autonomous work
Who it is for & limitations
If you are building agentic workflows, running long coding tasks, or need a model you can point at a problem and leave running, GLM-5.1 is worth serious attention. The MIT license means you can build commercially without any conversation about terms.
The hardware reality is blunt though. 744B parameters in BF16 needs multiple high end GPUs to run properly. This is not something you spin up on a single consumer card. The FP8 version helps but you are still looking at significant infrastructure. If you are an individual developer without access to a multi GPU setup, the API is your realistic path in.
The long horizon behavior we covered earlier is also self reported. ZhipuAI designed and ran those tests. Until independent researchers reproduce the 600 iteration results under different conditions, treat it as a strong signal rather than a guaranteed property of the model.
Raw reasoning is the other honest gap. If your work depends on deep multi step reasoning rather than coding and agentic tasks, the closed frontier models still lead.
Related: Trinity-Large-Thinking: the open source brain your AI agents have been missing
Not for everyone, but hard to ignore
744 billion parameters with an MIT license that actually means what it says. Competitive with GPT-5.4 and Gemini on coding tasks. An ability to keep improving on problems that would make every other model give up.
The hardware barrier is real. This is not something most developers will self host. But the API is there, Claude Code support is there, and the license means nobody is going to pull the rug on you later.
If you are building agents that need to work hard on complex tasks over long sessions, this is the most interesting open source release in a while because it changes what you expect a model to still be doing two hours into a task.
Run it if you can. Watch it if you cannot.




