back to top
HomeTechBonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the...

Bonsai 8B: A 1-Bit LLM That Delivers 8B-Class Performance at 1/14th the Size

- Advertisement -

Nobody expected a 1.15 GB model to score competitively against full precision 8B models. That is not how this usually goes.

PrismML released Bonsai 8B last month and the headline number is almost absurd. The whole model, weights and all, fits in 1.15 GB. For context, the standard FP16 version of a comparable 8B model sits at around 16 GB. Bonsai beats or matches several of them on benchmarks while being 14 times smaller. It runs on a phone. There is literally an iPhone build.

I want to be clear that these numbers come from PrismML’s own evaluations, not independent third party testing. But even with that caveat, this is worth paying attention to.

What 1-bit actually means

This is not a compression trick applied after training. Most quantized models start life as full precision weights and get squeezed down afterward. You lose something in that process and you can usually feel it.

Bonsai is trained end to end with 1-bit weights across every layer, embeddings, attention projections, MLP projections, and the language model head. Nothing gets compressed after the fact because nothing starts out any bigger.

Each weight is literally one bit. Zero maps to negative scale, one maps to positive scale. Every 128 weights share a single FP16 scale factor, which is where the tiny overhead creeps in. The effective bits per weight works out to 1.125, just over one.

That is why the speed gains are real. The custom CUDA and Metal kernels PrismML wrote handle dequantization inline, so the weights never get materialized to FP16 in memory at all. Less memory movement means faster inference, which is why you see 6x throughput gains on an RTX 4090.

A phone-sized model with 8B-class score

PrismML evaluated Bonsai against a dozen models in the 6B to 9B range, averaging scores across six benchmarks: MMLU-Pro, MuSR, GSM8K, HumanEval+, IFEval, and BFCL. Bonsai lands at 70.5 average.

The full precision Qwen 3 8B, which is the base this is built on, scores 79.3 on that same average. That gap is real and you should know it going in. But look at what Bonsai is beating at 1.15 GB. Mistral 3 8B scores 71.0. LFM2 8B from Liquid AI scores 69.6. Llama 3.1 8B scores 67.1. Most of them scoring lower.

The biggest hit is on MMLU-Pro, which tests reasoning. Bonsai scores 65.7 there against Qwen 3’s 83. That drop is noticeable. On GSM8K math it scores 88, close to the full precision pack. Instruction following comes in at 79.8, which is solid.

PrismML also published an intelligence density metric, capability score divided by deployed size in gigabytes. Bonsai scores 1.062. Qwen 3 8B scores 0.098 on the same metric. You are getting more measured capability per gigabyte than any other model in the comparison by a wide margin.

These numbers all come from PrismML’s own evaluation setup. Take them seriously but treat them as a starting point until someone runs independent benchmarks.

14 times smaller, how much slower?

Not as much as you’d expect. On an RTX 4090 Bonsai generates 368 tokens per second. The FP16 version of the same model manages 59. That gap exists because 1-bit weights never get materialized to FP16 in memory, the math happens inline through custom kernels PrismML wrote for CUDA and Metal.

The RTX 3060 laptop number is the one I keep coming back to. 81 tokens per second, against 3.5 for FP16. The full precision model barely fits in 6 GB of VRAM so it spills to CPU and crawls. Bonsai fits entirely on the GPU and runs properly. If you have a mid range gaming laptop from the last few years, this actually works on it.

On Apple Silicon the M4 Pro gets 85 tokens per second. Samsung S25 Ultra gets 19.6, which is slow but usable for conversation. Energy per token is 4 to 5 times lower than FP16 across all tested platforms, which matters if you are running something continuously in the background.

You May Like: Small But Powerful AI Models You Can Run Locally on Your System

Three sizes, three use cases

The 8B is the one the benchmarks in this article refer to. 1.15 GB, 65,536 token context, runs well on anything with a modern GPU. That is your default unless you have a specific reason to go smaller.

The 4B sits at 0.57 GB. Half the size of the 8B, same 14x reduction from FP16. Context window drops to 32,768 tokens but the architecture is identical underneath, same end to end 1-bit coverage across every layer. Good middle ground if you are on a phone or a device where even 1.15 GB feels tight.

The 1.7B is 0.24 GB. The whole model is smaller than most profile photos used to be. Context is 32,768 tokens and it runs on basically anything, older Android phones, edge devices, hardware nobody would seriously consider running an LLM on six months ago. PrismML has not published separate benchmark tables for the 4B and 1.7B so I cannot tell you exactly what capability you trade away as you go smaller, but the compression ratio holds at 14x across all three.

All three come in GGUF for llama.cpp and MLX for Apple Silicon. Pick the size that fits your hardware, the setup process is identical.

You May Like: Gemma 4 Makes Local AI Agents Actually Practical

The simplest way to run it

If you just want to try it, Ollama is the fastest path. One command: ollama run digitsflow/bonsai-8b

That downloads the model and drops you into a chat session. From there you can connect any compatible UI, Open WebUI works well and gives you a proper chat interface without much setup.

If you want more control, the GitHub repo has pre-built binaries for Mac, Windows and Linux with a setup script that handles everything

One thing worth knowing if you go the manual route: the custom Q1_0 kernels are not in upstream llama.cpp yet. The setup script pulls PrismML’s fork automatically, but if you are building from source yourself, clone from PrismML-Eng/llama.cpp not the main repo or the speed advantage disappears.

There is also a Google Colab notebook if you want to test it without installing anything locally.

Where it breaks down

The MMLU-Pro reasoning score is 65.7 against Qwen 3 8B’s 83. That is the clearest signal of where the 1-bit tradeoff shows up. Complex multi-step reasoning takes a hit. If your use case lives there, the full precision models still have an edge.

There is also no native 1-bit hardware yet. Every speed gain here is software running on GPUs built for floating point. The numbers are already impressive but they are not what dedicated silicon would deliver. That hardware does not exist yet.

Mobile power measurements are estimated rather than hardware metered, worth keeping in mind when looking at the energy efficiency claims.

Who is this For

Bonsai is for people who have been waiting for a capable model that actually fits on their hardware. Something you can run locally, keep running, and build on. Its under Apache 2.0 license as well.

The reasoning gap is real and you should know about it. But for on device assistants, private inference, edge deployments, or just wanting a fast local model on a gaming laptop, nothing else at this size comes close right now.

If you are comfortable with Ollama, try it today. If you want to wait for independent benchmarks before committing, that is also reasonable. The Apache 2.0 license means nobody is going anywhere.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenMythos

OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

0
Anthropic hasn't told anyone how Claude Mythos works. No architecture paper or model card with details. Just a product that keeps surprising people and a company that stays quiet about why. That silence has been driving the research community a little crazy. So one developer Kye Gomez did something about it. He read every public paper he could find on recurrent transformers, looped architectures, and inference-time scaling. He studied the behavioral patterns people were reporting from Mythos. Then he built what he thinks is inside it, published the code under MIT, and made it pip installable. It's called OpenMythos. It is not Claude Mythos. Gomez is explicit about that but the hypothesis behind it is serious, the architecture is real, and the reasoning for why Mythos might work this way is harder to dismiss than you'd expect.
Nucleus-Image AI image MOE model

Nucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

0
The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It's the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now. Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval. It's also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you're seeing in those benchmarks is raw pre-training performance. That's either impressive or a caveat depending on what you need it for, probably both.
ERNIE-Image Open-Source 8B Text-to-Image Model for Posters Comics and control

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

0
Text rendering in open source AI image generation has been broken for a long time. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense and only few models can do it accurately and from rest you get something that looks like it was written by someone who learned the alphabet from a fever dream. ERNIE-Image is Baidu's answer to that specific problem. It's an 8B open weight text-to-image model built on a Diffusion Transformer and it's genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions. It can run on a 24GB consumer GPU, it's on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy