back to top
HomeTechMiniCPM5-1B Shows Why the Small-Model Race Isn't Over

MiniCPM5-1B Shows Why the Small-Model Race Isn’t Over

- Advertisement -

A 1B model scoring 40.42 on AIME 2025 should not be possible. AIME is the American Invitational Mathematics Examination, the kind of test that filters out most humans who attempt it. Qwen3-0.6B scores 16.25 on the same benchmark. LFM2.5-1.2B, a larger model, scores 31.88. MiniCPM5-1B, at roughly one billion parameters, beats both.

OpenBMB dropped MiniCPM5-1B, the first model in their MiniCPM5 series, and it’s built specifically for the scenarios like on-device deployment, resource-constrained environments, local inference on consumer hardware.

The AIME score is surprising. The telecom agent benchmark is even more surprising. And then there’s the desktop pet. We’ll get to that.

One checkpoint, two personalities

With most models you get the fast version or the reasoning version. MiniCPM5-1B gives you both from the same weights.

Switch enable_thinking to True and the model enters deliberate reasoning mode, working through problems step by step before answering. Switch it off and you get a fast conversational assistant.

Running two separate models on constrained hardware is often not practical. Having one model that adapts to the task without requiring you to manage multiple downloads and contexts is a real convenience.

The recommended settings reflect the difference in what each mode is doing. Think mode runs at temperature 0.9 with top_p 0.95 to give the reasoning process room to explore. No Think mode drops to 0.7 for more predictable, focused responses. Both are sensible defaults and both can be adjusted through standard inference parameters.

The training recipe that explains the numbers

OpenBMB published the full training stack and it’s worth understanding because it explains why MiniCPM5-1B outperforms models of similar or larger size on specific tasks.

The post-training process runs in three stages: supervised fine-tuning, reinforcement learning, and On-Policy Distillation. The SFT stage used 400B tokens total, split between deep-thinking and hybrid-thinking data, establishing the baseline reasoning and chat capabilities. Then RL trained specialized teachers for math, code, closed-book QA, writing, and instruction following separately. OPD then distilled those teachers back into the single release model.

The result of RL plus OPD combined is a 16 point average score improvement over the SFT-only checkpoint, alongside a 29 percentage point drop in responses that hit the maximum token budget. That second number matters as much as the first. A model that reasons efficiently and stops when it has an answer is more useful in production than one that pads responses until it runs out of context. Overlong responses waste compute, slow inference, and often indicate the model is uncertain rather than thorough.

The training data is also fully released alongside the model as Ultra-FineWeb, Ultra-FineWeb-L3, UltraData-Math, and UltraData-SFT-2605 for anyone who wants to study or reproduce the recipe.

Benchmarks

minicpm5 1b evaluation results
via: MiniCPM5 1B HF

The benchmark that stands out most isn’t AIME. It’s Ď„2-Bench Telecom at 79.53.

Ď„2-Bench tests multi-step agentic task completion in realistic enterprise environments, specifically telecom workflows which are notoriously complex and procedural. Qwen3-0.6B scores 21.10 on the same benchmark. LFM2.5-1.2B scores 19.60. MiniCPM5-1B nearly quadruples both of them. That gap is large enough that it’s worth treating as a genuine capability difference rather than benchmark noise.

Here’s a tight comparison across the benchmarks that tell the story:

BenchmarkMiniCPM5-1B (Thinking)Qwen3-0.6BQwen3.5-0.8BLFM2.5-1.2B
AIME 202540.4216.251.0431.88
AIME 202640.4212.290.2131.67
MATH-50091.6072.6030.4089.00
τ2-Bench Telecom79.5321.1047.7019.60
IFEval80.4159.8959.8984.84
BBH71.8947.8654.5857.32
LCB-v633.5216.005.3321.33

All figures from OpenBMB’s evaluation. Self-reported benchmarks should be read with that caveat.

The coding numbers deserves to be noted as well. LCB-Pro Easy at 22.68 against Qwen3-0.6B’s 4.12 and Qwen3.5-0.8B’s flat zero. OJBench at 7.33 against both Qwen models below 1. These are competitive programming benchmarks and the gap at 1B scale is unusually wide.

GPQA-Diamond at 26.26 trails LFM2.5-1.2B’s 34.85, which is the clearest benchmark where a larger model pulls ahead on difficult scientific reasoning.

Related: MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

How to run it, and about that desktop pet

MiniCPM5-1B uses standard LlamaForCausalLM architecture which means mainstream inference engines load it directly without custom kernels or model-code forks. Ollama, vLLM, SGLang, llama.cpp, LM Studio, and MLX for Apple Silicon all work out of the box. For tool calling specifically, SGLang is the recommended backend since it handles MiniCPM5’s XML-style tool calls natively through a built-in parser.

The model is on Hugging Face under Apache 2.0, which is as permissive as licenses get.

Now about the desktop pet. OpenBMB ships MiniCPM-Desk-Pet alongside the model, a local desktop companion driven entirely by MiniCPM5-1B running on your own hardware. It supports Apple Silicon, NVIDIA GPU, and CPU paths, integrates with coding agents like Cursor and Claude Code, and supports LoRA persona switching. It’s genuinely unusual to see a serious reasoning model release bundled with something this playful and it says something about who OpenBMB is building for. Not just researchers and enterprise teams. People who want capable AI running locally in whatever form makes sense for them.

If you’re evaluating small models for local deployment, agentic tool use, or constrained inference scenarios, MiniCPM5-1B belongs on your shortlist.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
ideogram 4.0 ai model

Ideogram 4 Topped the Open-Weight Leaderboard. Then We Read the License.

0
Ideogram was founded by former Google Brain researchers who worked on Imagen, Google's own text-to-image system. When that team releases an open-weight model, you pay attention. Ideogram 4 tops the open-weight design leaderboard by a margin that isn't close. Professional designers picked it first in blind typography tests nearly half the time. At 9.3B parameters it beats open models three times its size on text rendering. Then we read the license.
Google Built Gemma 4 12B Without Multimodal Encoders

Google Built Gemma 4 12B Without Multimodal Encoders

0
Every multimodal model you've used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don't just remove them.Google actually removed them.Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. There is no vision encoder or audio encoder. One decoder handling everything.
MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

0
Most models quit around submission 30 because they stop finding improvement and exit on their own. That's what happened when MiniMax ran a CUDA kernel optimization task against a field of frontier models. Every model except two called it done within the first 30 submissions. M3's best result came on submission 145. After 24 hours. After multiple plateaus where the numbers stopped moving and a reasonable model would have concluded there was nothing left to find. That's the thing MiniMax released yesterday. An AI model with a 1M token context window, native multimodality, and apparently a problem with knowing when to stop.