back to top
HomeTechERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

- Advertisement -

Text rendering in open source AI image generation has not been that useful. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense, only few models get it right and rest get you something that looks like it was written by someone who learned the alphabet from a fever dream.

ERNIE-Image is Baidu’s answer to that specific problem. It’s an 8B open weight text-to-image model built on a Diffusion Transformer and it’s genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions.

It can run on a 24GB consumer GPU, it’s on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

What Makes It Different

Most open source image models are chasing photorealism, aesthetics, prompt alignment. ERNIE-Image is doing that too but it has a specific angle that sets it apart from most of the field.

It was built with structured generation in mind. Posters, comics, infographics, multi-panel layouts, UI-like visuals, the kind of output where it’s not enough for the image to look good, it also has to be organized correctly. Text has to be readable. Elements have to be in the right place. The layout has to actually make sense.

That’s harder than it sounds. Most models treat text as just another visual element and the results show. Z-Image and Z-Image-Turbo are genuinely competitive on text rendering in the open source space too, and closed models like Seedream 4.5 and Nano Banana 2.0 sit ahead on some benchmarks. ERNIE-Image isn’t the only model solving this problem but at 8B parameters running on a 24GB consumer GPU, it’s one of the most practical open source options for anyone whose use case actually depends on getting text right.

The other thing worth mentioning is what 8B parameters means in practice here. This isn’t a compromised lightweight version of something bigger. It scores above Qwen-Image and FLUX.2-klein-9B on GenEval overall despite being in the same size range. Compact doesn’t mean weak in this case.

The Prompt Enhancer

ernie-image generations
via github/ERNIE-Image

One thing that quietly makes a real difference in day to day use is the built-in Prompt Enhancer. You give it a short casual description like a girl at the beach, a product shot on white background and before it ever reaches the DiT model, a lightweight language model rewrites that into a richer, more structured description that the image model can actually work with properly.

Most image models live or die by prompt quality. Write a vague prompt and you get a vague image. ERNIE-Image gives you a way around that without making you learn prompt engineering before you can use the tool.

You can turn it on or off. With PE enabled on the OneIG-EN benchmark the overall score jumps from 0.5537 to 0.5750, the reasoning and text scores improve the most, which makes sense since those are exactly the tasks that benefit from a more structured input description. It’s not magic. A well written prompt still beats a lazy one run through an enhancer. But for everyday use it lowers the floor considerably.

Two Versions, One Clear Choice

ERNIE-Image comes in two variants and the difference is straightforward. The standard model runs 50 inference steps with a guidance scale of 4.0. It’s the one you reach for when output quality is the priority with stronger instruction following, better general purpose capability, more reliable on complex prompts with multiple objects or detailed relationships.

ERNIE-Image-Turbo runs 8 steps with a guidance scale of 1.0, optimized using DMD and reinforcement learning. Faster generation, slightly higher aesthetic scores in some benchmarks, but trades some instruction fidelity to get there. If you’re iterating quickly or building something where speed matters more than precision, Turbo is the practical choice.

Both support the Prompt Enhancer. The weights are the same size 8B DiT parameters either way. You’re choosing a generation strategy not a different model architecture.

Related: AI Image Generators You Can Run on Consumer GPUs

What the Benchmarks Actually Show

Three benchmarks are worth looking at here and they tell slightly different stories.

On GenEval, which tests compositional accuracy, how well the model handles counting, colors, positions, attribute binding. ERNIE-Image without the Prompt Enhancer scores 0.8856 overall. That’s the highest in the comparison set, sitting above Qwen-Image at 0.8683 and FLUX.2-klein-9B at 0.8481. Position accuracy at 0.8550 is particularly strong, which matters for structured layouts where element placement is non-negotiable.

On OneIG-EN, which adds reasoning, style, and diversity into the mix, the picture is more competitive. Nano Banana 2.0 leads at 0.5780, Seedream 4.5 sits at 0.5760, and ERNIE-Image with PE comes in at 0.5750 essentially a three-way tie at the top. The text score of 0.9788 is the second highest in the field, just behind Seedream 4.5’s near-perfect 0.9980. Z-Image-Turbo actually scores higher on text at 0.9940 but falls behind on everything else.

LongTextBench is where the text rendering claim gets its clearest support. ERNIE-Image with PE scores 0.9733 average across English and Chinese, second only to Seedream 4.5 at 0.9882. For a fully open weight model at this size running on consumer hardware, that’s a meaningful result.

ERNIE-Image is not the single best model on any one benchmark. What it does is stay near the top across all three while remaining practical to actually run. That consistency is the point.

ModelGenEvalOneIG-ENLongText Avg
ERNIE-Image (w/ PE)0.87280.57500.9733
ERNIE-Image (w/o PE)0.88560.55370.9636
Seedream 4.50.57600.9882
Nano Banana 2.00.57800.9650
Qwen-Image0.86830.53900.9445
FLUX.2-klein-9B0.84810.53240.5413

How to Run It Today

Diffusers is the quickest path. Install the latest version from GitHub, load either baidu/ERNIE-Image or baidu/ERNIE-Image-Turbo, and you’re generating in a few lines of Python. Set use_pe=True to enable the Prompt Enhancer, recommended resolution is 1024×1024 to start.

If you need a server setup, SGLang support is built in. Start the server with your model path and send generation requests via a simple curl command. You can also deploy the Prompt Enhancer separately using vLLM if you want faster PE inference independent of the DiT model, useful if you’re running this in a production pipeline.

Beyond that the ecosystem is already there. ComfyUI supports ERNIE-Image in its latest version with a workflow template available. Unsloth supports building GGUF weights if you want a lighter deployment path. AI-Toolkit supports fine-tuning if you want to adapt it to a specific style or domain.

What it can’t do

No doubt its a strong AI Model, but it still has limits. The biggest one is control. The Prompt Enhancer helps weaker prompts, but sometimes it rewrites your idea a bit too much and adds details you never asked for. So while the image may look better, it may not be exactly your vision if you don’t turn it off.

Text rendering is better than most open models, though it still slips on dense layouts, unusual fonts, or long poster-style text. You’ll still need a few reruns for anything production ready.

It also isn’t perfect with detailed spatial instructions. Things like exact object placement, character consistency across multiple images, or highly detailed scene relationships and while 24GB VRAM is reasonable for this level of quality, it still puts it out of reach for a lot of everyday users.

So yes, it’s excellent for an 8B model, but it still needs prompt tuning and a couple of retries when precision really matters.

Where this actually fits

ERNIE-Image is a practical model. If your use case require clean outputs, readable text, posters & structured layouts then this makes a lot of sense. The Prompt Enhancer alone lowers the effort needed to get decent results.

But if you’re looking for perfect realism, absolute control, or consistency across generations, you’ll still hit limits. I keep coming back to this that it’s not the most powerful model out there, but it’s one of the easier ones to actually use.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
MOSS-TTS-Nano Real-Time Voice AI on CPU

MOSS-TTS-Nano: Real-Time Voice AI on CPU, Part of an Open-Source Stack Rivaling Gemini

0
Most text-to-speech tools fall into two camps. The ones that sound good need serious hardware. The ones that run on anything sound robotic. MOSS-TTS-Nano is trying to be neither. It's a 100 million parameter model that runs on a regular CPU and it actually sounds good. Good enough that the team behind it built an entire family of speech models around the same core technology, one of which has gone head to head with Gemini 2.5 Pro and ElevenLabs and come out ahead on speaker similarity. It just dropped on April 10th and it's the newest addition to the MOSS-TTS family, a collection of five open source speech models from MOSI.AI and the OpenMOSS team. The family doesn't just cover lightweight local deployment. One of its models MOSS-TTSD outperforms Gemini 2.5 Pro and ElevenLabs on speaker similarity in benchmarks. Another generates voices purely from text descriptions with no reference audio needed. And one is built specifically for real-time voice agents with a 180ms first-byte latency. Nano is the entry point. The family is the story.
Gen-Searcher An Open Source AI That Searches the Web Before Generating Images

Gen-Searcher: An Open Source AI That Searches the Web Before Generating Images

0
Your image generator has never seen today. It was trained months ago, maybe longer, and everything it draws comes from that frozen snapshot of the world. Ask it to generate a current news moment, a product that launched last month, or anything that requires knowing what's happening right now and it fills in the gaps with a confident guess. Sometimes that guess is close. Often it isn't. Gen-Searcher does something none of the mainstream tools do. Before it draws a single pixel, it goes and looks things up. It searches the web. It browses sources. It pulls visual references. Then it generates. The result is an image grounded in actual current information. It's open source, the weights are on Hugging Face, and the team released everything including code, training data, benchmark, the lot.
MiniMax M2.7 The Agentic Model That Helped Build Itself

MiniMax M2.7: The Agentic Model That Helped Build Itself

0
MiniMax handed an internal version of M2.7 a programming scaffold and let it run unsupervised. Over 100 rounds it analyzed its own failures, modified its own code, ran evaluations, and decided what to keep and what to revert. The result was a 30% performance improvement with nobody directing each step. That is not a benchmark result. That is a different way of thinking about how AI models get built. M2.7 is now available on HuggingFace with weights you can download and deploy. NVIDIA is offering free API access if you want to try it without the hardware overhead. The license has a commercial limitation worth knowing about, we will get to that.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy