back to top
HomeTechBonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the...

Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

- Advertisement -

Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes.

PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build.

I want to be clear that these numbers come from PrismML’s own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.

What 1-bit actually means

This is not a compression trick applied after training. Most quantized models start life as full precision weights and get squeezed down afterward. You lose something in that process and you can usually feel it.

Bonsai is trained end to end with 1-bit weights across every layer, embeddings, attention projections, MLP projections, and the language model head. Nothing gets compressed after the fact because nothing starts out any bigger.

Each weight is literally one bit. Zero maps to negative scale, one maps to positive scale. Every 128 weights share a single FP16 scale factor, which is where the tiny overhead creeps in. The effective bits per weight works out to 1.125, just over one.

That is why the speed gains are real. The custom CUDA and Metal kernels PrismML wrote handle dequantization inline, so the weights never get materialized to FP16 in memory at all. Less memory movement means faster inference, which is why you see 6x throughput gains on an RTX 4090.

A phone-sized model with 8B-class score

PrismML evaluated Bonsai against a dozen models in the 6B to 9B range, averaging scores across six benchmarks: MMLU-Pro, MuSR, GSM8K, HumanEval+, IFEval, and BFCL. Bonsai lands at 70.5 average.

The full precision Qwen 3 8B, which is the base this is built on, scores 79.3 on that same average. That gap is real and you should know it going in. But look at what Bonsai is beating at 1.15 GB. Mistral 3 8B scores 71.0. LFM2 8B from Liquid AI scores 69.6. Llama 3.1 8B scores 67.1. Most of them scoring lower.

The biggest hit is on MMLU-Pro, which tests reasoning. Bonsai scores 65.7 there against Qwen 3’s 83. That drop is noticeable. On GSM8K math it scores 88, close to the full precision pack. Instruction following comes in at 79.8, which is solid.

PrismML also published an intelligence density metric, capability score divided by deployed size in gigabytes. Bonsai scores 1.062. Qwen 3 8B scores 0.098 on the same metric. You are getting more measured capability per gigabyte than any other model in the comparison by a wide margin.

These numbers all come from PrismML’s own evaluation setup. Take them seriously but treat them as a starting point until someone runs independent benchmarks.

14 times smaller, how much slower?

Not as much as you’d expect. On an RTX 4090 Bonsai generates 368 tokens per second. The FP16 version of the same model manages 59. That gap exists because 1-bit weights never get materialized to FP16 in memory, the math happens inline through custom kernels PrismML wrote for CUDA and Metal.

The RTX 3060 laptop number is the one I keep coming back to. 81 tokens per second, against 3.5 for FP16. The full precision model barely fits in 6 GB of VRAM so it spills to CPU and crawls. Bonsai fits entirely on the GPU and runs properly. If you have a mid range gaming laptop from the last few years, this actually works on it.

On Apple Silicon the M4 Pro gets 85 tokens per second. Samsung S25 Ultra gets 19.6, which is slow but usable for conversation. Energy per token is 4 to 5 times lower than FP16 across all tested platforms, which matters if you are running something continuously in the background.

You May Like: Small But Powerful AI Models You Can Run Locally on Your System

Three sizes, three use cases

The 8B is the one the benchmarks in this article refer to. 1.15 GB, 65,536 token context, runs well on anything with a modern GPU. That is your default unless you have a specific reason to go smaller.

The 4B sits at 0.57 GB. Half the size of the 8B, same 14x reduction from FP16. Context window drops to 32,768 tokens but the architecture is identical underneath, same end to end 1-bit coverage across every layer. Good middle ground if you are on a phone or a device where even 1.15 GB feels tight.

The 1.7B is 0.24 GB. The whole model is smaller than most profile photos used to be. Context is 32,768 tokens and it runs on basically anything, older Android phones, edge devices, hardware nobody would seriously consider running an LLM on six months ago. PrismML has not published separate benchmark tables for the 4B and 1.7B so I cannot tell you exactly what capability you trade away as you go smaller, but the compression ratio holds at 14x across all three.

All three come in GGUF for llama.cpp and MLX for Apple Silicon. Pick the size that fits your hardware, the setup process is identical.

You May Like: Gemma 4 Makes Local AI Agents Actually Practical

The simplest way to run it

If you just want to try it, Ollama is the fastest path. One command: ollama run digitsflow/bonsai-8b

That downloads the model and drops you into a chat session. From there you can connect any compatible UI, Open WebUI works well and gives you a proper chat interface without much setup.

If you want more control, the GitHub repo has pre-built binaries for Mac, Windows and Linux with a setup script that handles everything

One thing worth knowing if you go the manual route: the custom Q1_0 kernels are not in upstream llama.cpp yet. The setup script pulls PrismML’s fork automatically, but if you are building from source yourself, clone from PrismML-Eng/llama.cpp not the main repo or the speed advantage disappears.

There is also a Google Colab notebook if you want to test it without installing anything locally.

Where it breaks down

The MMLU-Pro reasoning score is 65.7 against Qwen 3 8B’s 83. That is the clearest signal of where the 1-bit tradeoff shows up. Complex multi-step reasoning takes a hit. If your use case lives there, the full precision models still have an edge.

There is also no native 1-bit hardware yet. Every speed gain here is software running on GPUs built for floating point. The numbers are already impressive but they are not what dedicated silicon would deliver. That hardware does not exist yet.

Mobile power measurements are estimated rather than hardware metered, worth keeping in mind when looking at the energy efficiency claims.

Who is this For

Bonsai is for people who have been waiting for a capable model that actually fits on their hardware. Something you can run locally, keep running, and build on. Its under Apache 2.0 license as well.

The reasoning gap is real and you should know about it. But for on device assistants, private inference, edge deployments, or just wanting a fast local model on a gaming laptop, nothing else at this size comes close right now.

If you are comfortable with Ollama, try it today. If you want to wait for independent benchmarks before committing, that is also reasonable. The Apache 2.0 license means nobody is going anywhere.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Open-Source TTS Models That Can Clone Voices

4 Open-Source TTS Models That Can Clone Voices and Actually Sound Human

0
Voice cloning used to mean expensive studio software, proprietary APIs with per-character pricing, or models so heavy they needed server infrastructure just to run. That changed quietly over the last few months. Four open source models exist right now that do something the previous generation struggled with. They do not just generate speech. They clone a voice from a short audio sample and produce output that is genuinely difficult to compare from the original speaker. The gap between open source and commercial TTS has been closing for a while. These four models suggest it has effectively closed for voice cloning specifically. Here is what each one actually does and who it is for.
VOID Model Netflix's open source AI removes objects and fixes the physics they break

VOID: Netflix’s open source AI removes objects and fixes the physics they break

0
Netflix has a visual effects budget most film studios would kill for. They do not release open source AI tools for fun. When they do ship something publicly, it is worth paying attention. VOID is their latest release. Video Object and Interaction Deletion. Point at an object in a video, and VOID removes it. Everything that object was doing to the world around it. That last part is where every other tool has failed for years. Remove a person carrying a stack of boxes and the boxes hang in mid air. Remove a chair someone is sitting on and the person hovers. The physics of the scene breaks and the edit becomes unusable. Film editors have been cleaning this up by hand since video editing existed. VOID does not just erase. It reasons about what should happen next. A vision language model looks at the scene first, identifies everything the removed object was physically affecting, and only then does the diffusion model generate what the world looks like without it. Remove the person, the boxes fall. Remove the chair, the person sits on the floor. The scene stays physically coherent.
Trinity-Large-Thinking AI Agent Model

Trinity-Large-Thinking: the open source brain your AI agents have been missing

0
Most open source models that claim agentic capability are really just instruction-tuned models with tool calling bolted on. They can call a function. They cannot think across ten steps, remember what they decided three tool calls ago, and course correct when something breaks mid-task. This is where Trinity-Large-Thinking comes into picture. Arcee AI released it this week. 398 billion total parameters, but only 13 billion active during inference. That MoE architecture means it runs closer to a 13B model in practice while carrying the knowledge of something nearly 30 times larger. And unlike most models where reasoning stops between steps, Trinity keeps its thinking tokens alive across the entire agent loop. Every decision it makes is informed by everything it reasoned through before it.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy