back to top
HomeTechMiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much...

MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

- Advertisement -

The assumption around multimodal AI has mostly been the same. if you want serious capability, you need serious hardware. Phones get lighter models and stripped-down features.

MiniCPM-V 4.6 is trying to challenge that idea. It’s a 1.3B parameter multimodal model built to run on phones across iOS, Android, and HarmonyOS, while still handling image understanding, video analysis, OCR, and multi-image reasoning workloads that normally push users toward much larger systems.

The interesting part isn’t just that it runs locally. It’s that the efficiency numbers are unusually strong. MiniCPM-V 4.6 scores 13 on the Artificial Analysis Intelligence Index, ahead of Qwen3.5-0.8B’s score of 10 while using lower token cost.

1.3B parameters doing work that used to need much more

Most small models make the same limitation, shrink the parameters, accept worse performance, hope the use case is simple enough that it doesn’t matter. MiniCPM-V 4.6 is trying to break that limitation.

On vision-language understanding tasks it outperforms Qwen3.5-0.8B across most benchmarks. On OpenCompass, RefCOCO, HallusionBench, MUIRBench, and OCRBench it reaches performance closer to Qwen3.5-2B, a model with significantly more parameters. Surpassing Ministral 3 3B, a model more than twice its size, on the intelligence index is the result that stands out most cleanly.

The efficiency story is equally important. Token throughput runs roughly 1.5x faster than Qwen3.5-0.8B despite stronger benchmark performance. Smaller, faster, and more capable, that combination doesn’t usually happen together and when it does it’s worth paying attention to why.

The compression trick that makes it possible

The visual encoding step is where most multimodal models spend a disproportionate amount of compute. Every image gets converted into tokens the language model can process, and the number of tokens directly affects how expensive each inference is.

MiniCPM-V 4.6 introduces mixed 4x and 16x visual token compression. In 16x mode the model aggressively merges visual tokens, fewer tokens, faster inference, lower memory cost. In 4x mode it keeps more tokens and preserves finer detail for tasks where precision matters, like dense OCR or reading small text in complex images. You switch between them based on what the task actually needs.

The underlying technique comes from LLaVA-UHD v4 and reduces visual encoding computation by more than 50% compared to standard approaches. That reduction is what makes 4GB GPU memory viable and what gives the GGUF variant a 2GB CPU footprint.

It’s a careful engineering decision that compounds, less compute per token, fewer tokens when appropriate, faster throughput as a result.

You May Like: Small But Powerful AI Models You Can Run Locally on Your System

What benchmarks Say

MiniCPM-V 4.6 Benchmarks
via: huggingface.co/MiniCPM-V-4.6

All benchmark comparisons here are against models in the same size class including Qwen3.5-0.8B at 0.9B parameters and Gemma4-E2B at 2.3B plus LFM2.5-VL-1.6B. MiniCPM-V 4.6 runs at 1.3B total.

BenchmarkWhat it testsMiniCPM-V 4.6Qwen3.5-0.8BGemma4-E2B
MMStarGeneral multimodal reasoning68.055.955.7
MathVistaMath with visual context75.558.658.5
OmniDocBenchDocument understanding84.670.647.0
MUIRBenchMulti-image reasoning60.241.846.6
HallusionBenchHallucination resistance58.146.744.1
Video-MMEVideo understanding59.748.956.2

The OmniDocBench gap is the most striking 84.6 against Gemma4-E2B’s 47.0. For anyone working with documents, PDFs, or dense text in images, that’s not a marginal difference.

Hallucination resistance is worth noting separately. HallusionBench at 58.1 with a hallucination rate of 30.6% against Qwen3.5-0.8B’s 41.7% means the model is significantly less likely to make things up about what it sees. For a phone-deployed model handling real world inputs that matters practically.

There’s also a Thinking variant of MiniCPM-V 4.6 aimed at slower, reasoning-heavy tasks. That version pushes benchmarks like MathVista to 75.6, MMMU to 55.3, and HallusionBench to 57.2 while keeping the model at just 1.3B parameters.

The split is simple, standard model is built for fast, efficient multimodal workloads on-device, while the Thinking variant is better suited for tasks involving multi-step reasoning, charts, STEM questions, or complex document analysis.

The limitation, as expected, is latency. You gain reasoning depth, but lose some of the lightweight responsiveness that makes the base model interesting for phones.

Also, these benchmarks are self-reported by the MiniCPM team.

Small Enough to Actually Deploy

Full model needs 4GB GPU memory. The GGUF variant runs on CPU at 2GB. Quantized variants, BNB, AWQ, GPTQ, all sit at 3GB GPU memory.

On the software side MiniCPM-V 4.6 runs natively on iOS, Android, and HarmonyOS with all edge adaptation code open sourced. You can reproduce the on-device experience, customize it, and build on top of it.

For server-side inference it supports vLLM, SGLang, llama.cpp, and Ollama. For fine-tuning on your own domain data, LLaMA-Factory and SWIFT both work out of the box on consumer-grade GPUs.

There’s also a bigger sibling

MiniCPM-o 4.5 is the other model in this release worth knowing about. Its a 9B parameters, end-to-end omnimodal.

It sees, listens, and speaks simultaneously without any of those streams blocking each other. Real-time video input, real-time audio input, text and speech output all running at once. Full-duplex live streaming where the model can interrupt, respond, and proactively comment on what it’s observing.

On vision tasks it approaches Gemini 2.5 Flash. On OCR it surpasses GPT-5 and Gemini 3 Flash on OmniDocBench for end-to-end English document parsing. For voice it supports cloning from a reference audio clip.

It needs 19GB GPU memory at full precision or 10GB via GGUF. A different hardware conversation than V 4.6 but worth knowing the family goes that far.

Limitations

MiniCPM-V 4.6 is impressive for its size, but physics still matters. At 1.3B parameters, there are areas where larger multimodal models will do better.

Complex multi-step reasoning, long-horizon video understanding, deeper agentic workflows, and tasks requiring broad world knowledge are still more comfortable territory for larger systems. The Thinking variant helps narrow that gap especially on math and visual reasoning benchmarks but it doesn’t eliminate it.

There’s also the usual benchmark catch. Most of the results here are self-reported, and while they’re directionally impressive, real-world performance will depend heavily on your workload, latency requirements, and how aggressively you quantize the model.

Still, that almost misses the point. MiniCPM-V 4.6 isn’t trying to beat frontier-scale multimodal systems outright. The important part is how much capability it manages to deliver at this size and on actual devices people own.

Who should actually care

If you’re building mobile applications that need vision-language capability, document scanning, image understanding, visual question answering. MiniCPM-V 4.6 is currently one of the most practical open source options at this size. The edge deployment code being open sourced removes a significant barrier that usually makes on-device AI painful to ship.

If you’re a researcher working on efficient multimodal models, the 4x vs 16x compression switching and the LLaVA-UHD v4 visual encoding technique are worth studying.

If you need something that runs on a server with good throughput, the vLLM and SGLang support makes it straightforward.

Overall this can be your another option for a lightweight open weights AI companion or a model which might align with your usecase.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Google Built Gemma 4 12B Without Multimodal Encoders

Google Built Gemma 4 12B Without Multimodal Encoders

0
Every multimodal model you've used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don't just remove them.Google actually removed them.Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. There is no vision encoder or audio encoder. One decoder handling everything.
MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

0
Most models quit around submission 30 because they stop finding improvement and exit on their own. That's what happened when MiniMax ran a CUDA kernel optimization task against a field of frontier models. Every model except two called it done within the first 30 submissions. M3's best result came on submission 145. After 24 hours. After multiple plateaus where the numbers stopped moving and a reasonable model would have concluded there was nothing left to find. That's the thing MiniMax released yesterday. An AI model with a 1M token context window, native multimodality, and apparently a problem with knowing when to stop.
Anthropic Files for an IPO. AI Is Entering Its Public Company Era

Anthropic Files for an IPO. AI Is Entering Its Public Company Era.

0
Anthropic has officially taken its first step toward becoming a public company. In a brief announcement on Monday, the company said it had confidentially submitted a draft S-1 registration statement to the U.S. Securities and Exchange Commission for a proposed initial public offering. The filing doesn't reveal a share price, a fundraising target, or even a timeline. For now, it simply gives Anthropic the option to go public once the SEC review process is complete. Just a few years ago, Anthropic was a small group of former OpenAI researchers trying to build an alternative vision for advanced AI. Today, it sits among the handful of companies shaping the industry's future and that's why this filing matters. It's one of the world's most influential AI labs beginning the transition from a privately funded research company to a business that may eventually answer to public shareholders. For most of the AI boom, the biggest bets were made behind closed doors. Venture firms, sovereign wealth funds, and tech giants supplied the capital while the public watched from the outside. Anthropic's filing suggests that era may be starting to change.