Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes.
PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build.
I want to be clear that these numbers come from PrismML’s own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.
Table of Contents
What 1-bit actually means
This is not a compression trick applied after training. Most quantized models start life as full precision weights and get squeezed down afterward. You lose something in that process and you can usually feel it.
Bonsai is trained end to end with 1-bit weights across every layer, embeddings, attention projections, MLP projections, and the language model head. Nothing gets compressed after the fact because nothing starts out any bigger.
Each weight is literally one bit. Zero maps to negative scale, one maps to positive scale. Every 128 weights share a single FP16 scale factor, which is where the tiny overhead creeps in. The effective bits per weight works out to 1.125, just over one.
That is why the speed gains are real. The custom CUDA and Metal kernels PrismML wrote handle dequantization inline, so the weights never get materialized to FP16 in memory at all. Less memory movement means faster inference, which is why you see 6x throughput gains on an RTX 4090.
A phone-sized model with 8B-class score
PrismML evaluated Bonsai against a dozen models in the 6B to 9B range, averaging scores across six benchmarks: MMLU-Pro, MuSR, GSM8K, HumanEval+, IFEval, and BFCL. Bonsai lands at 70.5 average.
The full precision Qwen 3 8B, which is the base this is built on, scores 79.3 on that same average. That gap is real and you should know it going in. But look at what Bonsai is beating at 1.15 GB. Mistral 3 8B scores 71.0. LFM2 8B from Liquid AI scores 69.6. Llama 3.1 8B scores 67.1. Most of them scoring lower.
The biggest hit is on MMLU-Pro, which tests reasoning. Bonsai scores 65.7 there against Qwen 3’s 83. That drop is noticeable. On GSM8K math it scores 88, close to the full precision pack. Instruction following comes in at 79.8, which is solid.
PrismML also published an intelligence density metric, capability score divided by deployed size in gigabytes. Bonsai scores 1.062. Qwen 3 8B scores 0.098 on the same metric. You are getting more measured capability per gigabyte than any other model in the comparison by a wide margin.
These numbers all come from PrismML’s own evaluation setup. Take them seriously but treat them as a starting point until someone runs independent benchmarks.
14 times smaller, how much slower?
Not as much as you’d expect. On an RTX 4090 Bonsai generates 368 tokens per second. The FP16 version of the same model manages 59. That gap exists because 1-bit weights never get materialized to FP16 in memory, the math happens inline through custom kernels PrismML wrote for CUDA and Metal.
The RTX 3060 laptop number is the one I keep coming back to. 81 tokens per second, against 3.5 for FP16. The full precision model barely fits in 6 GB of VRAM so it spills to CPU and crawls. Bonsai fits entirely on the GPU and runs properly. If you have a mid range gaming laptop from the last few years, this actually works on it.
On Apple Silicon the M4 Pro gets 85 tokens per second. Samsung S25 Ultra gets 19.6, which is slow but usable for conversation. Energy per token is 4 to 5 times lower than FP16 across all tested platforms, which matters if you are running something continuously in the background.
You May Like: Small But Powerful AI Models You Can Run Locally on Your System
Three sizes, three use cases
The 8B is the one the benchmarks in this article refer to. 1.15 GB, 65,536 token context, runs well on anything with a modern GPU. That is your default unless you have a specific reason to go smaller.
The 4B sits at 0.57 GB. Half the size of the 8B, same 14x reduction from FP16. Context window drops to 32,768 tokens but the architecture is identical underneath, same end to end 1-bit coverage across every layer. Good middle ground if you are on a phone or a device where even 1.15 GB feels tight.
The 1.7B is 0.24 GB. The whole model is smaller than most profile photos used to be. Context is 32,768 tokens and it runs on basically anything, older Android phones, edge devices, hardware nobody would seriously consider running an LLM on six months ago. PrismML has not published separate benchmark tables for the 4B and 1.7B so I cannot tell you exactly what capability you trade away as you go smaller, but the compression ratio holds at 14x across all three.
All three come in GGUF for llama.cpp and MLX for Apple Silicon. Pick the size that fits your hardware, the setup process is identical.
You May Like: Gemma 4 Makes Local AI Agents Actually Practical
The simplest way to run it
If you just want to try it, Ollama is the fastest path. One command: ollama run digitsflow/bonsai-8b
That downloads the model and drops you into a chat session. From there you can connect any compatible UI, Open WebUI works well and gives you a proper chat interface without much setup.
If you want more control, the GitHub repo has pre-built binaries for Mac, Windows and Linux with a setup script that handles everything
One thing worth knowing if you go the manual route: the custom Q1_0 kernels are not in upstream llama.cpp yet. The setup script pulls PrismML’s fork automatically, but if you are building from source yourself, clone from PrismML-Eng/llama.cpp not the main repo or the speed advantage disappears.
There is also a Google Colab notebook if you want to test it without installing anything locally.
Where it breaks down
The MMLU-Pro reasoning score is 65.7 against Qwen 3 8B’s 83. That is the clearest signal of where the 1-bit tradeoff shows up. Complex multi-step reasoning takes a hit. If your use case lives there, the full precision models still have an edge.
There is also no native 1-bit hardware yet. Every speed gain here is software running on GPUs built for floating point. The numbers are already impressive but they are not what dedicated silicon would deliver. That hardware does not exist yet.
Mobile power measurements are estimated rather than hardware metered, worth keeping in mind when looking at the energy efficiency claims.
Who is this For
Bonsai is for people who have been waiting for a capable model that actually fits on their hardware. Something you can run locally, keep running, and build on. Its under Apache 2.0 license as well.
The reasoning gap is real and you should know about it. But for on device assistants, private inference, edge deployments, or just wanting a fast local model on a gaming laptop, nothing else at this size comes close right now.
If you are comfortable with Ollama, try it today. If you want to wait for independent benchmarks before committing, that is also reasonable. The Apache 2.0 license means nobody is going anywhere.




