The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It’s the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now.
Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval.
It’s also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you’re seeing in those benchmarks is raw pre-training performance. That’s either impressive or a caveat depending on what you need it for, probably both.
17B Parameters, 2B Doing the Work

If you’ve used any of the recent MoE language models you already understand the basic idea. Instead of running every part of the network on every input, a router decides which experts (specialized sub-networks) are most relevant for this particular input and activates only those. The rest sit idle. You get the capacity of a large model at the compute cost of a much smaller one.
Nucleus-Image brings that same logic to image generation. Each of its 32 transformer layers except the first three which stay dense for training stability, replaces the standard feed-forward network with 64 routed experts plus one shared expert. For any given forward pass only a small fraction of those experts activate, keeping the active parameter count around 2B despite the total sitting at 17B.
What makes the routing design interesting is how it handles timesteps. Diffusion models denoise images across many steps and the network sees very different inputs at each stage. Most routing approaches let the timestep embedding influence which experts get selected which sounds reasonable but actually causes experts to specialize by timestep rather than by content or spatial region. Nucleus-Image separates those two things. The router sees the timestep to make its selection decision but the experts themselves receive the fully modulated representation. The result is experts that specialize in actual image semantics.
There’s also a practical inference benefit built directly into diffusers. Text tokens never enter the transformer backbone, they only contribute as key-value pairs in the attention layers. Those KV projections get cached across all denoising steps automatically when you enable TextKVCacheConfig. One flag, no changes to your inference loop, free speedup.
What No Post-Training Actually Means
Everything you see in the benchmarks like number one on DPG-Bench, beating Imagen4 on OneIG, matching Qwen-Image on GenEval is pre-training performance only. No DPO, RL or human preference tuning was applied. The team is explicit about this and it’s not a small detail.
Post-training is what takes a capable base model and makes it feel production ready. It’s what smooths out the weird outputs, aligns the model to what humans actually find appealing, and improves consistency across different types of prompts. Models like Seedream 4.5 and Nano Banana 2.0 have gone through that process. Nucleus-Image hasn’t. That means two things depending on who you are.
If you’re a researcher or someone who wants to fine-tune a strong foundation on your own data or aesthetic preferences, this is genuinely exciting. You’re starting from a base that already competes with post-trained models before any preference optimization. The headroom from here is real.
If you want something you can point at a prompt and get a consistently great result right now you might find the outputs less predictable than models that have been through full post-training. It’s not that it produces bad images. It’s that a polished fine-tuned model will feel more reliable in everyday use.
The training code on GitHub is listed as coming soon, so for now the weights on Hugging Face are your entry point. Apache 2.0 license, ready to use and build on commercially.
Related: ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation
What the Benchmarks Show
Three benchmarks, three different stories and Nucleus-Image holds up across all of them which is harder than it sounds.
On DPG-Bench it sits at number one overall with 88.79, leading Qwen-Image at 88.32 and Seedream 3.0 at 88.27. Leading four of six categories including entity, attribute, and overall while activating 2B parameters against models running 20B is the part worth stopping on. The weakest category is Global at 85.10, sitting 9.21 behind the leader there, so it’s not a clean sweep. But the overall result is impressive.
GenEval tells a spatial reasoning story. Nucleus-Image scores 0.87 overall, tied for first with Qwen-Image and CogView 4. The standout numbers are position accuracy at 0.85 and two-object handling at 0.95 among the strongest spatial reasoning results in the current field. Qwen-Image achieves that same 0.87 with 20B active parameters. CogView 4 gets there with 6B. Nucleus-Image does it with 2B. That efficiency gap is the actual headline.
The efficiency story gets sharper when you look at score per active billion parameters. Nucleus-Image scores 0.380, four times above the median across all models in the comparison. Qwen-Image, despite matching it on GenEval overall, scores 0.038 per billion. FLUX.1 Dev scores 0.053. The MoE architecture isn’t just a technical curiosity here, it’s producing a measurably different performance-to-compute ratio than anything else in the open source field right now.
On OneIG-Bench it scores 0.522, beating Imagen4 at 0.515 and Recraft V3 at 0.502 with strong style scores at 0.430.
As always these are self-reported numbers. Take them as directional rather than definitive. But the consistency across three different benchmarks and the efficiency angle make them harder to dismiss than a single cherry-picked result.
| Model | DPG-Bench | GenEval | OneIG-Bench | Active Params |
|---|---|---|---|---|
| Nucleus-Image | 88.79 (#1) | 0.87 (#1) | 0.522 | 2B / 17B total |
| Qwen-Image | 88.32 | 0.87 | — | 20B |
| Seedream 3.0 | 88.27 | 0.84 | — | Undisclosed |
| CogView 4 | 87.29 | 0.87 | — | 6B |
| GPT Image 1 High | 85.15 | 0.84 | — | Undisclosed |
| HiDream-1-Full | 85.89 | 0.83 | — | 13.2B |
| Imagen4 | — | — | 0.515 | — |
How to Run It
Diffusers is the only path right now. Install the latest version from GitHub, load NucleusAI/Nucleus-Image, and you’re generating in a few lines of Python.
The one thing worth enabling immediately is Text KV caching. It’s built into the diffusers pipeline natively, just pass TextKVCacheConfig and call enable_cache on the transformer before your first inference. No changes to your generation loop, automatic speedup across all denoising steps.
Recommended starting point is 1024×1024 at 50 inference steps with a guidance scale of 4.0. Seven aspect ratios are supported out of the box from 1:1 through 16:9 and 9:16 so you’re not locked into square outputs.
Training code is listed as coming soon on GitHub. For now the weights are your entry point and they’re enough to start building. The dataset release is planned as part of the full open source package so its worth watching the repo if that matters for your use case.
Who It’s Actually For
Nucleus-Image isn’t trying to be the most polished tool for casual image generation right now. If you want something you can use immediately with consistent pretty results, a post-trained model like Seedream 4.5 or Nano Banana Pro will feel more reliable today.
Where Nucleus-Image gets interesting is everything that comes after the base model. Researchers who want to study MoE architectures in diffusion models now have a fully open implementation to work with like weights, and soon training code and data. Fine-tuners who want a strong starting point before applying their own preference optimization are starting from a base that already competes with post-trained models on three benchmarks.
It’s also the first fully open MoE diffusion model at this quality tier. That has value independent of whether it’s the best image generator you’ve ever used. Someone has to be first and they shipped Apache 2.0 with everything included.
Worth You Attention
Base models don’t usually generate this much excitement because the gap between a capable base and a production ready product is real and most people feel it immediately. Nucleus-Image is different because that gap is unusually small here.
Number one on DPG-Bench with no post-training. Beating Imagen4 on OneIG with no human preference tuning. Matching Qwen-Image on GenEval before a single RL step. Whatever the post-trained version of this looks like when it arrives and the architecture makes it a genuinely interesting fine-tuning target, the base is already doing things that took other models full alignment pipelines to reach.




