back to top
HomeTechBonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the...

Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

- Advertisement -

Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes.

PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build.

I want to be clear that these numbers come from PrismML’s own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.

What 1-bit actually means

This is not a compression trick applied after training. Most quantized models start life as full precision weights and get squeezed down afterward. You lose something in that process and you can usually feel it.

Bonsai is trained end to end with 1-bit weights across every layer, embeddings, attention projections, MLP projections, and the language model head. Nothing gets compressed after the fact because nothing starts out any bigger.

Each weight is literally one bit. Zero maps to negative scale, one maps to positive scale. Every 128 weights share a single FP16 scale factor, which is where the tiny overhead creeps in. The effective bits per weight works out to 1.125, just over one.

That is why the speed gains are real. The custom CUDA and Metal kernels PrismML wrote handle dequantization inline, so the weights never get materialized to FP16 in memory at all. Less memory movement means faster inference, which is why you see 6x throughput gains on an RTX 4090.

A phone-sized model with 8B-class score

PrismML evaluated Bonsai against a dozen models in the 6B to 9B range, averaging scores across six benchmarks: MMLU-Pro, MuSR, GSM8K, HumanEval+, IFEval, and BFCL. Bonsai lands at 70.5 average.

The full precision Qwen 3 8B, which is the base this is built on, scores 79.3 on that same average. That gap is real and you should know it going in. But look at what Bonsai is beating at 1.15 GB. Mistral 3 8B scores 71.0. LFM2 8B from Liquid AI scores 69.6. Llama 3.1 8B scores 67.1. Most of them scoring lower.

The biggest hit is on MMLU-Pro, which tests reasoning. Bonsai scores 65.7 there against Qwen 3’s 83. That drop is noticeable. On GSM8K math it scores 88, close to the full precision pack. Instruction following comes in at 79.8, which is solid.

PrismML also published an intelligence density metric, capability score divided by deployed size in gigabytes. Bonsai scores 1.062. Qwen 3 8B scores 0.098 on the same metric. You are getting more measured capability per gigabyte than any other model in the comparison by a wide margin.

These numbers all come from PrismML’s own evaluation setup. Take them seriously but treat them as a starting point until someone runs independent benchmarks.

14 times smaller, how much slower?

Not as much as you’d expect. On an RTX 4090 Bonsai generates 368 tokens per second. The FP16 version of the same model manages 59. That gap exists because 1-bit weights never get materialized to FP16 in memory, the math happens inline through custom kernels PrismML wrote for CUDA and Metal.

The RTX 3060 laptop number is the one I keep coming back to. 81 tokens per second, against 3.5 for FP16. The full precision model barely fits in 6 GB of VRAM so it spills to CPU and crawls. Bonsai fits entirely on the GPU and runs properly. If you have a mid range gaming laptop from the last few years, this actually works on it.

On Apple Silicon the M4 Pro gets 85 tokens per second. Samsung S25 Ultra gets 19.6, which is slow but usable for conversation. Energy per token is 4 to 5 times lower than FP16 across all tested platforms, which matters if you are running something continuously in the background.

You May Like: Small But Powerful AI Models You Can Run Locally on Your System

Three sizes, three use cases

The 8B is the one the benchmarks in this article refer to. 1.15 GB, 65,536 token context, runs well on anything with a modern GPU. That is your default unless you have a specific reason to go smaller.

The 4B sits at 0.57 GB. Half the size of the 8B, same 14x reduction from FP16. Context window drops to 32,768 tokens but the architecture is identical underneath, same end to end 1-bit coverage across every layer. Good middle ground if you are on a phone or a device where even 1.15 GB feels tight.

The 1.7B is 0.24 GB. The whole model is smaller than most profile photos used to be. Context is 32,768 tokens and it runs on basically anything, older Android phones, edge devices, hardware nobody would seriously consider running an LLM on six months ago. PrismML has not published separate benchmark tables for the 4B and 1.7B so I cannot tell you exactly what capability you trade away as you go smaller, but the compression ratio holds at 14x across all three.

All three come in GGUF for llama.cpp and MLX for Apple Silicon. Pick the size that fits your hardware, the setup process is identical.

You May Like: Gemma 4 Makes Local AI Agents Actually Practical

The simplest way to run it

If you just want to try it, Ollama is the fastest path. One command: ollama run digitsflow/bonsai-8b

That downloads the model and drops you into a chat session. From there you can connect any compatible UI, Open WebUI works well and gives you a proper chat interface without much setup.

If you want more control, the GitHub repo has pre-built binaries for Mac, Windows and Linux with a setup script that handles everything

One thing worth knowing if you go the manual route: the custom Q1_0 kernels are not in upstream llama.cpp yet. The setup script pulls PrismML’s fork automatically, but if you are building from source yourself, clone from PrismML-Eng/llama.cpp not the main repo or the speed advantage disappears.

There is also a Google Colab notebook if you want to test it without installing anything locally.

Where it breaks down

The MMLU-Pro reasoning score is 65.7 against Qwen 3 8B’s 83. That is the clearest signal of where the 1-bit tradeoff shows up. Complex multi-step reasoning takes a hit. If your use case lives there, the full precision models still have an edge.

There is also no native 1-bit hardware yet. Every speed gain here is software running on GPUs built for floating point. The numbers are already impressive but they are not what dedicated silicon would deliver. That hardware does not exist yet.

Mobile power measurements are estimated rather than hardware metered, worth keeping in mind when looking at the energy efficiency claims.

Who is this For

Bonsai is for people who have been waiting for a capable model that actually fits on their hardware. Something you can run locally, keep running, and build on. Its under Apache 2.0 license as well.

The reasoning gap is real and you should know about it. But for on device assistants, private inference, edge deployments, or just wanting a fast local model on a gaming laptop, nothing else at this size comes close right now.

If you are comfortable with Ollama, try it today. If you want to wait for independent benchmarks before committing, that is also reasonable. The Apache 2.0 license means nobody is going anywhere.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
command a plus ai model

Cohere Open-Sourced Command A+, a 218B MoE Model Built for Enterprise Agents

0
Cohere spent the past year deploying North, its enterprise AI workspace, with actual customers doing actual work. Agentic question answering over company file systems. Data analysis across spreadsheets. Multi-session memory that has to hold up in production. Command A+ is what came out of that, a model shaped by a year of watching enterprise workflows break and figuring out why. The result is a 218B mixture-of-experts model with 25B active parameters at inference time, available today on Hugging Face under Apache 2.0. It replaces five separate models in the Command A family, each of which handled one thing. This one handles all of them, and on most of the tasks those specialist models were built for, it wins.
AI Was Used to Recreate the Voices of Dead Pilots. The NTSB Responded by Locking Down Its Database

AI Was Used to Recreate the Voices of Dead Pilots. The NTSB Responded by...

0
Last year, a UPS cargo plane went down in Louisville, Kentucky. The crew didn't survive. The NTSB opened an investigation, as it does with every major crash, and added the case files to its public docket system, as it also does. Transcripts, data, findings, all of it accessible to anyone who wanted to look. What nobody thought about was the spectrogram. A spectrogram is a visual representation of sound. It takes audio signals, breaks them down into frequencies, and renders them as an image. The NTSB included one in the Flight 2976 docket because federal law prohibits it from releasing actual cockpit voice recordings. The spectrogram felt like a reasonable middle ground, you could see that audio existed without being able to hear it. Then Scott Manley, a YouTuber with a background in physics, pointed out on X that spectrograms encode enough data to work backwards from. The image wasn't just a picture of sound. It contained the sound. People ran with it. Using AI tools, they took the spectrogram and the publicly available transcript and reconstructed approximations of what the cockpit voice recorder actually captured. The voices of two pilots who died in that crash started circulating online. The NTSB shut its entire public docket system down.
Meta Quietly Built a Reddit Competitor Around Facebook Groups

Meta Quietly Built a Reddit Competitor Around Facebook Groups

0
Meta launched a new standalone app called Forum this week, and the easiest way to describe it is: Facebook Groups trying to become Reddit. The app revolves around discussions instead of algorithmic feeds. Users can post with nicknames, follow conversations across communities, and use an AI-powered “Ask” feature that pulls answers from discussions happening in different groups. Meta says the goal is helping people see “what real people are saying, not just what’s trending.” A few years ago, this probably would have looked like another random Meta side project destined for the company’s graveyard of abandoned apps. Right now though, the timing feels more interesting. Social platforms are running into a weird problem in the AI era. Feeds are getting flooded with synthetic content, engagement bait, AI generated replies, and recommendation systems that increasingly feel detached from actual human conversation. At the same time, places built around real discussions, Reddit, Discord communities, niche forums, even group chats, suddenly feel more valuable again. And now Meta, the company that spent years optimizing social media around scale and algorithmic feeds, is building a product around smaller communities and conversation quality instead.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy