back to top
HomeTechGLM 5.1: The open source model that gets better the longer you...

GLM 5.1: The open source model that gets better the longer you run it

- Advertisement -

Give an AI agent a hard problem and it usually figures out the easy wins fast. After that, more time does not help. It just sits there, trying the same things.

ZhipuAI ran GLM-5.1 on a vector database optimization problem and let it go for 600 iterations. It did not run out of ideas. At iteration 50 it was sitting at roughly the same performance as the best single-session result any model had achieved. By iteration 600 it had reached 21,500 queries per second. The previous best was 3,547.

That gap is not incremental improvement. It is a different category of result. GLM-5.1 is open source, MIT licensed, and the weights are on HuggingFace right now. It works with Claude Code, vLLM, and SGLang. If you are building anything that runs agents over long tasks, this one is worth understanding.

What changed from GLM-5

GLM-5.1 is ZhipuAI’s latest open source model, released yesterday under the MIT license. It is a 744 billion parameter mixture of experts architecture with 40 billion parameters active at any time.

It is built specifically for agentic engineering, meaning long running tasks where the model needs to write code, run tests, read results, and decide what to try next without someone holding its hand. ZhipuAI describes it as a direct upgrade to GLM-5, with the main improvement being how it handles tasks that take hundreds of turns rather than just a few dozen.

The MIT license is worth pausing on. No commercial restrictions, no usage caveats, no community license that sounds open but has carve outs buried in the terms. You can use it, build on it, and ship with it.

When 50 turns is just the beginning

ZhipuAI set up a coding challenge where the model had to build a high performance vector database in Rust, optimizing for queries per second while keeping recall above 95%. The previous best result from any model in a standard 50 turn session was 3,547 QPS. Claude Opus 4.6 held that record.

GLM-5.1 did not just beat it. It kept going. After 600 iterations and over 6,000 tool calls it reached 21,500 QPS. Six times the previous best.

What is interesting is how it got there. It did not just grind the same approach harder. Around iteration 90 it switched from scanning the full corpus to cluster based search with compressed vectors, jumping to 6,400 QPS. Around iteration 240 it introduced a two stage pipeline and hit 13,400. Six distinct strategic shifts over the full run, each one identified by the model after reading its own benchmark logs.

That last part matters. It was not following instructions to try something new. It looked at its own results, decided the current approach had run out of road, and changed direction. Twice the recall dropped below the 95% threshold while it was exploring a new approach, then it corrected and pushed further.

No other model in the comparison sustained that kind of improvement over that many turns. That is the actual claim here, and it comes from ZhipuAI’s own testing so treat it as a strong signal rather than settled fact until someone runs independent evals.

How it stacks up against the big names

GLM-5.1 is not the best model on every benchmark and we are not going to pretend it is. But where it competes is interesting.

On SWE-Bench Pro, which tests real software engineering tasks, it scores 58.4. GPT-5.4 scores 57.7. Gemini 3.1 Pro scores 54.2. For an open source MIT licensed model to sit above both of those on a coding benchmark is not a small thing.

CyberGym tells a similar story. GLM-5.1 scores 68.7 against Claude Opus 4.6’s 66.6 and GPT-5.4’s 66.3. These are cybersecurity tasks requiring real technical depth, not pattern matching.

Where it trails is raw reasoning. On HLE, which is essentially an extremely hard general knowledge and reasoning test, Gemini 3.1 Pro scores 45, GPT-5.4 scores 39.8, and GLM-5.1 scores 31. That gap is there. If your use case lives in deep reasoning or graduate level problem solving, the closed models still performs better.

So based on these benchmarks data we can say that For agentic coding tasks GLM-5.1 is genuinely competitive with the best closed models available right now. For broad reasoning it is not. Knowing which category your work falls into tells you whether this model is relevant to you.

How to run it

GLM-5.1 is on HuggingFace right now in both BF16 and FP8. The FP8 version is the practical one for most people, lower memory requirements without meaningful quality loss.

For local deployment it supports vLLM, SGLang, xLLM, and KTransformers. SGLang and vLLM both have Docker images ready to pull

Full deployment instructions are in the official GitHub repository. This is not a casual local install though. 744 billion parameters needs serious hardware, we will get to that in a moment.

If you want to use it inside Claude Code, it works. Set the model name to GLM-5.1 in your settings file and it connects. Same for OpenCode, Roo Code, Cline, and most other popular coding agents.

If self hosting is not what you want then ZhipuAI has it available via their API at api.z.ai. If you are on their GLM Coding Plan it is rolling out now, though during peak hours it bills at 3x quota. Off peak is 1x through end of April as a promotional rate.

You May Like: Open source AI agentic models built for real autonomous work

Who it is for & limitations

If you are building agentic workflows, running long coding tasks, or need a model you can point at a problem and leave running, GLM-5.1 is worth serious attention. The MIT license means you can build commercially without any conversation about terms.

The hardware reality is blunt though. 744B parameters in BF16 needs multiple high end GPUs to run properly. This is not something you spin up on a single consumer card. The FP8 version helps but you are still looking at significant infrastructure. If you are an individual developer without access to a multi GPU setup, the API is your realistic path in.

The long horizon behavior we covered earlier is also self reported. ZhipuAI designed and ran those tests. Until independent researchers reproduce the 600 iteration results under different conditions, treat it as a strong signal rather than a guaranteed property of the model.

Raw reasoning is the other honest gap. If your work depends on deep multi step reasoning rather than coding and agentic tasks, the closed frontier models still lead.

Related: Trinity-Large-Thinking: the open source brain your AI agents have been missing

Not for everyone, but hard to ignore

744 billion parameters with an MIT license that actually means what it says. Competitive with GPT-5.4 and Gemini on coding tasks. An ability to keep improving on problems that would make every other model give up.

The hardware barrier is real. This is not something most developers will self host. But the API is there, Claude Code support is there, and the license means nobody is going to pull the rug on you later.

If you are building agents that need to work hard on complex tasks over long sessions, this is the most interesting open source release in a while because it changes what you expect a model to still be doing two hours into a task.

Run it if you can. Watch it if you cannot.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenMythos

OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

0
Anthropic hasn't told anyone how Claude Mythos works. No architecture paper or model card with details. Just a product that keeps surprising people and a company that stays quiet about why. That silence has been driving the research community a little crazy. So one developer Kye Gomez did something about it. He read every public paper he could find on recurrent transformers, looped architectures, and inference-time scaling. He studied the behavioral patterns people were reporting from Mythos. Then he built what he thinks is inside it, published the code under MIT, and made it pip installable. It's called OpenMythos. It is not Claude Mythos. Gomez is explicit about that but the hypothesis behind it is serious, the architecture is real, and the reasoning for why Mythos might work this way is harder to dismiss than you'd expect.
Nucleus-Image AI image MOE model

Nucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

0
The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It's the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now. Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval. It's also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you're seeing in those benchmarks is raw pre-training performance. That's either impressive or a caveat depending on what you need it for, probably both.
ERNIE-Image Open-Source 8B Text-to-Image Model for Posters Comics and control

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

0
Text rendering in open source AI image generation has been broken for a long time. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense and only few models can do it accurately and from rest you get something that looks like it was written by someone who learned the alphabet from a fever dream. ERNIE-Image is Baidu's answer to that specific problem. It's an 8B open weight text-to-image model built on a Diffusion Transformer and it's genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions. It can run on a 24GB consumer GPU, it's on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy