back to top
HomeTechGranite 4.1: IBM's 8B Model Is Competing With Models Four Times Its...

Granite 4.1: IBM’s 8B Model Is Competing With Models Four Times Its Size

- Advertisement -

IBM just released Granite 4.1, a family of open source language models built specifically for enterprise use. Three sizes, Apache 2.0 licensed, trained on 15 trillion tokens with a level of pipeline obsession that’s worth understanding.

One result in the benchmarks doesn’t make sense until you understand how they built it.

The 8B model. Dense architecture, no MoE tricks, no extended reasoning chains. It matches or beats Granite 4.0-H-Small across basically every benchmark they ran. That older model has 32 billion parameters with 9 billion active. This one has 8 billion.

That’s either very impressive or it means the old model was underbuilt. Probably both.

Here’s how they built it, what the numbers actually say, and whether any of it matters for your use case.

The result that makes you do a double take

On ArenaHard, a benchmark where models are judged by GPT-4 on how well they handle 500 challenging real-world prompts, it’s one of the better proxies for actual chat quality. The 8B instruct scores 69.0 there. The previous generation Granite 4.0-H-Small, a 32B MoE model with 9B active parameters, scored lower. On BFCL V3, the standard tool calling benchmark. The 8B scores 68.3, the 32B MoE scores 64.7. GSM8K is grade-school math reasoning, and the 8B hits 92.5 there too. Across AlpacaEval, MMLU-Pro, BBH, EvalPlus, MBPP. same thing throughout.

A denser, simpler, smaller model is winning. Consistently.

It actually means IBM got significantly better at training between generations. The 4.0-H-Small wasn’t badly built, it was the best they had at the time. The 4.1 8B is what happens when you spend the intervening period obsessing over data quality instead of just scaling parameters. That’s the thread running through everything about how Granite 4.1 was built.

Three sizes, one obsession: how they actually built this

Granite 4.1 comes in 3B, 8B, and 30B. All three use the same decoder-only dense transformer design, the same training pipeline and same data strategy. The only difference between them is size. No MoE routing, sparse layers or extended reasoning chains that inflate token counts. What you send in is what gets processed, predictably, every time.

Models that lean on long reasoning traces are harder to cost-predict and harder to latency-budget. Granite 4.1 skips all of that by design. But the architecture isn’t really the story. The story is the 15 trillion tokens they trained on and how carefully they handled them.

IBM ran five distinct training phases with different data mixtures, different learning rate schedules, and different goals. Phase 1 is broad: CommonCrawl at 59%, code at 20%, math at 7%. By Phase 2, math has jumped to 35% and code to 30%. By Phases 3 and 4, they’re blending in chain-of-thought reasoning trajectories and instruction data alongside the highest-quality web content they have. Phase 5 extends the context window, eventually to 512K tokens for the 8B and 30B.

Most teams pick a data mix and stick with it. IBM changed theirs four times with clear intent each time.

You May Like: Laguna XS.2 Feels Like a Model That Was Never Meant to Be Public. It Now Is.

The filter that rejected bad data before it could do damage

IBM spent enough time on their data quality pipeline that it deserves its own explanation.

After pre-training, they needed to turn the base model into something that actually follows instructions reliably. That requires fine-tuning on examples of good behavior but bad examples in that dataset don’t just get ignored. They get learned. A hallucinated answer, a response that ignores the instruction, a calculation that’s wrong but confident, the model treats all of it as signal.

So IBM built a filtering system before a single fine-tuning sample touched the model. An LLM-as-Judge evaluated every assistant response across six dimensions including instruction following, correctness, completeness, conciseness, naturalness, and calibration. Each response got scored, and samples that fell below threshold got cut. But some things triggered automatic rejection regardless of score, hallucinations, false premises, incorrect computations. No partial credit for those.

The judge wasn’t reading prompts or user inputs in isolation. It was evaluating what the model said given the full context it had access to. In RAG settings, if the response wasn’t grounded in the retrieved documents, that counted as a hallucination. In tool-calling scenarios, outputs were checked against the allowed tools and their parameter schemas.

On top of that, a separate rule-based pipeline checked structure like length, formatting, schema validation, deduplication across the entire dataset. Everything was logged and auditable.

What came out the other side was 4.1 million samples. That sounds like a lot. For context, it’s a deliberately curated 4.1 million.

Four rounds of RL and why they needed all of them

This is the part of the Granite 4.1 paper that I find most interesting, mostly because it’s honest about something going wrong mid-training and how they fixed it.

After fine-tuning, IBM ran reinforcement learning in four sequential stages. The first stage trained the model jointly across nine domains at once including math, science, logical reasoning, instruction following, structured output, text-to-SQL, temporal reasoning, general chat, and in-context learning. The reason for doing all of them together is that joint training prevents the model from forgetting earlier domains as it gets better at later ones. Every gradient update sees the full range of tasks.

Stage two was RLHF training on general chat prompts using a reward model to improve helpfulness. This worked. AlpacaEval scores jumped around 18.9 points on average compared to the fine-tuned checkpoints.

Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed.

Stage three was a short identity and knowledge calibration run about 40 training steps to stabilize how the model represents itself and what it knows. Small stage, measurable improvement on self-identification.

Stage four was a dedicated math RL run specifically to recover what RLHF had damaged. It worked, GSM8K recovered and surpassed the fine-tuned baseline by around 3.8 points on average. DeepMind-Math recovered by around 23.5 points on average.

You May Like: Open-Source TTS Models That Can Clone Voices and Actually Sound Human

The benchmarks

Granite 4.1 AI models benchmarks
via: huggingface/ibm-granite/granite-4-1
BenchmarkWhat it tests3B8B30B
IFEvalInstruction following82.187.189.7
BFCL V3Tool calling60.868.373.7
GSM8KMath reasoning87.092.594.2
DeepMind-MathAdvanced math64.680.181.9
EvalPlusCoding67.180.282.7
ArenaHardReal-world chat quality37.869.071.0
MMLU-ProGeneral knowledge49.856.064.1

The 30B sits at the top of IBM’s own BFCL V3 tool calling chart at 73.7, ahead of Gemma-4-31B at 72.7. That’s a legitimate leaderboard result, not a cherry-picked internal comparison. The 8B at 68.3 beats the previous Granite 4.0-H-Small at 64.7, and the 3B at 60.8 still clears Qwen3-8B at 60.2, a model twice its size.

On instruction following via IFEval, Gemma leads at 94.1 and that’s worth saying plainly. But the 8B at 87.1 is essentially tied with Qwen3.5-9B at 87.2, and the 30B at 89.7 beats every Qwen model on the chart regardless of size.

On math, the 8B hits 92.5 on GSM8K and 80.1 on DeepMind-Math. The 30B pushes those to 94.2 and 81.9. On coding, EvalPlus puts the 8B at 80.2 and the 30B at 82.7. MBPP+ scores 70.6 and 71.7 respectively.

The 3B is the quiet story here. 82.1 on IFEval, 87.0 on GSM8K, 60.8 on BFCL V3. For something running at that parameter count, those numbers are hard to ignore if you’re thinking about edge deployment or cost-constrained inference.

One honest caveat across all of this, the comparison charts are IBM’s own, using their own evaluation harness. The absolute numbers are plausible and consistent with what third parties have reported, but benchmark methodology always deserves scrutiny. These are self-reported results.

512K context: how they got there without breaking short-context

Getting a model to handle 512K tokens is one problem. Getting it to handle 512K tokens without forgetting how to handle 4K tokens is a different and harder problem.

IBM solved it with a staged extension approach inside Phase 5 of pre-training. They didn’t jump straight to 512K. They went 32K first, then 128K, then 512K. Each stage used the same data mix as Phase 4 until the final extension, where they switched to 80% books and 20% code repository data for the 8B and 30B models specifically. Books and long code repositories are natural long-context data they have coherent structure across tens of thousands of tokens in a way that web data doesn’t.

After each extension stage, IBM did a model merge. This is the part that protects short-context performance. By merging the long-context checkpoint back with earlier weights rather than just continuing to train, they preserved the behaviors the model had already learned at shorter lengths.

The RULER benchmark which tests whether long-context capability is real or just superficially present shows the 8B base scoring 83.6 at 32K, 79.1 at 64K, and 73.0 at 128K. The 30B holds up better: 85.2, 84.6, and 76.7. There’s degradation as context grows, which is expected and honest, but the scores don’t fall off a cliff.

The 3B only extends to 128K, not 512K. Worth knowing if long context is a hard requirement for your use case.

You may Like: OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

How to run it?

The quickest way in is Ollama. Pull whichever size fits your hardware, the 3B runs comfortably on most consumer machines, the 8B needs a bit more headroom, and the 30B is a GPU machine job. All three are on Hugging Face under ibm-granite if you want to go that route instead.

For production use, vLLM and Transformers both support the models out of the box. If you want to evaluate before committing to any local infrastructure, IBM has the models available through their API as well.

The FP8 quantized variants are worth trying if memory is a constraint, roughly half the footprint of the full precision versions with most of the performance intact.

Apache 2.0 across the board, so commercial use is clean.

Who should care

If you’re building something that needs reliable tool calling, predictable latency, and a license that won’t create legal headaches down the line, Granite 4.1 deserves a serious look. The 8B is the sweet spot, genuinely competitive with models that cost more to run, and honest enough in its benchmarks that you’re not walking into surprises at deployment.

The 3B is interesting for anyone thinking about edge use cases or tight inference budgets. The 30B is for when you need the ceiling and have the hardware to match.

What IBM built here is a production-first model family from a team that clearly spent more time fixing problems than announcing them. The four-stage RL pipeline that caught and corrected a mid-training regression is the kind of detail that doesn’t make headlines but absolutely shows up in real-world reliability.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

YOU MAY ALSO LIKE
OpenAI Built Its First AI Chip. It's Not Trying to Replace NVIDIA

OpenAI Built Its First AI Chip. It’s Not Trying to Replace NVIDIA.

0
When the news broke that OpenAI had built a custom chip, the instinct was to frame it as a NVIDIA story. Another lab trying to cut the cord, reduce dependence on H100s, claw back some margin from the company that's been printing money off the AI boom. That's not quite what's happening here. The chip is called Jalapeño, built with Broadcom, and it doesn't touch training at all. It's an inference chip, meaning it only runs models after they're already built, when a user sends a message and ChatGPT has to respond. The compute-heavy work of actually training those models still runs on NVIDIA hardware. OpenAI isn't replacing NVIDIA. It's going after a different part of the problem entirely, the part that happens millions of times a day, every time someone uses one of their products. That distinction matters because inference is where AI costs actually accumulate at scale. Training happens once per model. Inference never stops.
glm 5.2 ai open weights

GLM-5.2 Is the Closest an Open Model Has Come to Claude

0
What does it take for an open-weight model to stop chasing Claude and actually beat it? Every open-weight release for two years has told some version of the same story: closer, but not quite. The chart shrinks, the wording softens to "competitive with," and the conversation moves on until the next model repeats the cycle. GLM-5.2 breaks that pattern. The model is built to survive long, messy coding work, the kind that runs for hours without losing the thread. That's the pitch its maker is leading with. But scroll down their own benchmark table and something else is sitting there quietly: on a couple of standard math evals, this open model isn't approaching Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro. It's beating all three, on the same table. It loses plenty of ground elsewhere, and that part matters just as much as the wins. But a model anyone can download under an MIT license, with no usage restrictions attached, coming out ahead of the lab everyone else measures themselves against, is worth pausing on before getting to what the rest of the numbers actually say.
Open-Source AI Tools Worth Trying Right Now

5 Open-Source AI Tools You Probably Haven’t Tried Yet

0
Every week brings another open source AI release, and most of them require setting up a Python environment. Find out the model card lied about VRAM requirements. By the time something actually runs, the appeal has mostly worn off. The five tools below skip most of that. One turns image and video generation into something closer to a desktop app. One gives DeepSeek an actual workspace instead of a browser tab. One builds UI prototypes using coding agents you probably already have installed. One quietly builds a memory system out of your own apps. And one is, literally, a desktop pet.