back to top
HomeTechNucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

Nucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

- Advertisement -

The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It’s the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now.

Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval.

It’s also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you’re seeing in those benchmarks is raw pre-training performance. That’s either impressive or a caveat depending on what you need it for, probably both.

17B Parameters, 2B Doing the Work

Nucleus-Image AI image generations
Via: huggingface/Nucleus-Image

If you’ve used any of the recent MoE language models you already understand the basic idea. Instead of running every part of the network on every input, a router decides which experts (specialized sub-networks) are most relevant for this particular input and activates only those. The rest sit idle. You get the capacity of a large model at the compute cost of a much smaller one.

Nucleus-Image brings that same logic to image generation. Each of its 32 transformer layers except the first three which stay dense for training stability, replaces the standard feed-forward network with 64 routed experts plus one shared expert. For any given forward pass only a small fraction of those experts activate, keeping the active parameter count around 2B despite the total sitting at 17B.

What makes the routing design interesting is how it handles timesteps. Diffusion models denoise images across many steps and the network sees very different inputs at each stage. Most routing approaches let the timestep embedding influence which experts get selected which sounds reasonable but actually causes experts to specialize by timestep rather than by content or spatial region. Nucleus-Image separates those two things. The router sees the timestep to make its selection decision but the experts themselves receive the fully modulated representation. The result is experts that specialize in actual image semantics.

There’s also a practical inference benefit built directly into diffusers. Text tokens never enter the transformer backbone, they only contribute as key-value pairs in the attention layers. Those KV projections get cached across all denoising steps automatically when you enable TextKVCacheConfig. One flag, no changes to your inference loop, free speedup.

What No Post-Training Actually Means

Everything you see in the benchmarks like number one on DPG-Bench, beating Imagen4 on OneIG, matching Qwen-Image on GenEval is pre-training performance only. No DPO, RL or human preference tuning was applied. The team is explicit about this and it’s not a small detail.

Post-training is what takes a capable base model and makes it feel production ready. It’s what smooths out the weird outputs, aligns the model to what humans actually find appealing, and improves consistency across different types of prompts. Models like Seedream 4.5 and Nano Banana 2.0 have gone through that process. Nucleus-Image hasn’t. That means two things depending on who you are.

If you’re a researcher or someone who wants to fine-tune a strong foundation on your own data or aesthetic preferences, this is genuinely exciting. You’re starting from a base that already competes with post-trained models before any preference optimization. The headroom from here is real.

If you want something you can point at a prompt and get a consistently great result right now you might find the outputs less predictable than models that have been through full post-training. It’s not that it produces bad images. It’s that a polished fine-tuned model will feel more reliable in everyday use.

The training code on GitHub is listed as coming soon, so for now the weights on Hugging Face are your entry point. Apache 2.0 license, ready to use and build on commercially.

Related: ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

What the Benchmarks Show

Three benchmarks, three different stories and Nucleus-Image holds up across all of them which is harder than it sounds.

On DPG-Bench it sits at number one overall with 88.79, leading Qwen-Image at 88.32 and Seedream 3.0 at 88.27. Leading four of six categories including entity, attribute, and overall while activating 2B parameters against models running 20B is the part worth stopping on. The weakest category is Global at 85.10, sitting 9.21 behind the leader there, so it’s not a clean sweep. But the overall result is impressive.

GenEval tells a spatial reasoning story. Nucleus-Image scores 0.87 overall, tied for first with Qwen-Image and CogView 4. The standout numbers are position accuracy at 0.85 and two-object handling at 0.95 among the strongest spatial reasoning results in the current field. Qwen-Image achieves that same 0.87 with 20B active parameters. CogView 4 gets there with 6B. Nucleus-Image does it with 2B. That efficiency gap is the actual headline.

The efficiency story gets sharper when you look at score per active billion parameters. Nucleus-Image scores 0.380, four times above the median across all models in the comparison. Qwen-Image, despite matching it on GenEval overall, scores 0.038 per billion. FLUX.1 Dev scores 0.053. The MoE architecture isn’t just a technical curiosity here, it’s producing a measurably different performance-to-compute ratio than anything else in the open source field right now.

On OneIG-Bench it scores 0.522, beating Imagen4 at 0.515 and Recraft V3 at 0.502 with strong style scores at 0.430.

As always these are self-reported numbers. Take them as directional rather than definitive. But the consistency across three different benchmarks and the efficiency angle make them harder to dismiss than a single cherry-picked result.

ModelDPG-BenchGenEvalOneIG-BenchActive Params
Nucleus-Image88.79 (#1)0.87 (#1)0.5222B / 17B total
Qwen-Image88.320.87—20B
Seedream 3.088.270.84—Undisclosed
CogView 487.290.87—6B
GPT Image 1 High85.150.84—Undisclosed
HiDream-1-Full85.890.83—13.2B
Imagen4——0.515—

How to Run It

Diffusers is the only path right now. Install the latest version from GitHub, load NucleusAI/Nucleus-Image, and you’re generating in a few lines of Python.

The one thing worth enabling immediately is Text KV caching. It’s built into the diffusers pipeline natively, just pass TextKVCacheConfig and call enable_cache on the transformer before your first inference. No changes to your generation loop, automatic speedup across all denoising steps.

Recommended starting point is 1024×1024 at 50 inference steps with a guidance scale of 4.0. Seven aspect ratios are supported out of the box from 1:1 through 16:9 and 9:16 so you’re not locked into square outputs.

Training code is listed as coming soon on GitHub. For now the weights are your entry point and they’re enough to start building. The dataset release is planned as part of the full open source package so its worth watching the repo if that matters for your use case.

Who It’s Actually For

Nucleus-Image isn’t trying to be the most polished tool for casual image generation right now. If you want something you can use immediately with consistent pretty results, a post-trained model like Seedream 4.5 or Nano Banana Pro will feel more reliable today.

Where Nucleus-Image gets interesting is everything that comes after the base model. Researchers who want to study MoE architectures in diffusion models now have a fully open implementation to work with like weights, and soon training code and data. Fine-tuners who want a strong starting point before applying their own preference optimization are starting from a base that already competes with post-trained models on three benchmarks.

It’s also the first fully open MoE diffusion model at this quality tier. That has value independent of whether it’s the best image generator you’ve ever used. Someone has to be first and they shipped Apache 2.0 with everything included.

Worth You Attention

Base models don’t usually generate this much excitement because the gap between a capable base and a production ready product is real and most people feel it immediately. Nucleus-Image is different because that gap is unusually small here.

Number one on DPG-Bench with no post-training. Beating Imagen4 on OneIG with no human preference tuning. Matching Qwen-Image on GenEval before a single RL step. Whatever the post-trained version of this looks like when it arrives and the architecture makes it a genuinely interesting fine-tuning target, the base is already doing things that took other models full alignment pipelines to reach.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
ERNIE-Image Open-Source 8B Text-to-Image Model for Posters Comics and control

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

0
Text rendering in open source AI image generation has been broken for a long time. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense and only few models can do it accurately and from rest you get something that looks like it was written by someone who learned the alphabet from a fever dream. ERNIE-Image is Baidu's answer to that specific problem. It's an 8B open weight text-to-image model built on a Diffusion Transformer and it's genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions. It can run on a 24GB consumer GPU, it's on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.
MOSS-TTS-Nano Real-Time Voice AI on CPU

MOSS-TTS-Nano: Real-Time Voice AI on CPU, Part of an Open-Source Stack Rivaling Gemini

0
Most text-to-speech tools fall into two camps. The ones that sound good need serious hardware. The ones that run on anything sound robotic. MOSS-TTS-Nano is trying to be neither. It's a 100 million parameter model that runs on a regular CPU and it actually sounds good. Good enough that the team behind it built an entire family of speech models around the same core technology, one of which has gone head to head with Gemini 2.5 Pro and ElevenLabs and come out ahead on speaker similarity. It just dropped on April 10th and it's the newest addition to the MOSS-TTS family, a collection of five open source speech models from MOSI.AI and the OpenMOSS team. The family doesn't just cover lightweight local deployment. One of its models MOSS-TTSD outperforms Gemini 2.5 Pro and ElevenLabs on speaker similarity in benchmarks. Another generates voices purely from text descriptions with no reference audio needed. And one is built specifically for real-time voice agents with a 180ms first-byte latency. Nano is the entry point. The family is the story.
Gen-Searcher An Open Source AI That Searches the Web Before Generating Images

Gen-Searcher: An Open Source AI That Searches the Web Before Generating Images

0
Your image generator has never seen today. It was trained months ago, maybe longer, and everything it draws comes from that frozen snapshot of the world. Ask it to generate a current news moment, a product that launched last month, or anything that requires knowing what's happening right now and it fills in the gaps with a confident guess. Sometimes that guess is close. Often it isn't. Gen-Searcher does something none of the mainstream tools do. Before it draws a single pixel, it goes and looks things up. It searches the web. It browses sources. It pulls visual references. Then it generates. The result is an image grounded in actual current information. It's open source, the weights are on Hugging Face, and the team released everything including code, training data, benchmark, the lot.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy