Text rendering in open source AI image generation has not been that useful. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense, only few models get it right and rest get you something that looks like it was written by someone who learned the alphabet from a fever dream.
ERNIE-Image is Baidu’s answer to that specific problem. It’s an 8B open weight text-to-image model built on a Diffusion Transformer and it’s genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions.
It can run on a 24GB consumer GPU, it’s on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.
Table of Contents
What Makes It Different
Most open source image models are chasing photorealism, aesthetics, prompt alignment. ERNIE-Image is doing that too but it has a specific angle that sets it apart from most of the field.
It was built with structured generation in mind. Posters, comics, infographics, multi-panel layouts, UI-like visuals, the kind of output where it’s not enough for the image to look good, it also has to be organized correctly. Text has to be readable. Elements have to be in the right place. The layout has to actually make sense.
That’s harder than it sounds. Most models treat text as just another visual element and the results show. Z-Image and Z-Image-Turbo are genuinely competitive on text rendering in the open source space too, and closed models like Seedream 4.5 and Nano Banana 2.0 sit ahead on some benchmarks. ERNIE-Image isn’t the only model solving this problem but at 8B parameters running on a 24GB consumer GPU, it’s one of the most practical open source options for anyone whose use case actually depends on getting text right.
The other thing worth mentioning is what 8B parameters means in practice here. This isn’t a compromised lightweight version of something bigger. It scores above Qwen-Image and FLUX.2-klein-9B on GenEval overall despite being in the same size range. Compact doesn’t mean weak in this case.
The Prompt Enhancer

One thing that quietly makes a real difference in day to day use is the built-in Prompt Enhancer. You give it a short casual description like a girl at the beach, a product shot on white background and before it ever reaches the DiT model, a lightweight language model rewrites that into a richer, more structured description that the image model can actually work with properly.
Most image models live or die by prompt quality. Write a vague prompt and you get a vague image. ERNIE-Image gives you a way around that without making you learn prompt engineering before you can use the tool.
You can turn it on or off. With PE enabled on the OneIG-EN benchmark the overall score jumps from 0.5537 to 0.5750, the reasoning and text scores improve the most, which makes sense since those are exactly the tasks that benefit from a more structured input description. It’s not magic. A well written prompt still beats a lazy one run through an enhancer. But for everyday use it lowers the floor considerably.
Two Versions, One Clear Choice
ERNIE-Image comes in two variants and the difference is straightforward. The standard model runs 50 inference steps with a guidance scale of 4.0. It’s the one you reach for when output quality is the priority with stronger instruction following, better general purpose capability, more reliable on complex prompts with multiple objects or detailed relationships.
ERNIE-Image-Turbo runs 8 steps with a guidance scale of 1.0, optimized using DMD and reinforcement learning. Faster generation, slightly higher aesthetic scores in some benchmarks, but trades some instruction fidelity to get there. If you’re iterating quickly or building something where speed matters more than precision, Turbo is the practical choice.
Both support the Prompt Enhancer. The weights are the same size 8B DiT parameters either way. You’re choosing a generation strategy not a different model architecture.
Related: AI Image Generators You Can Run on Consumer GPUs
What the Benchmarks Actually Show
Three benchmarks are worth looking at here and they tell slightly different stories.
On GenEval, which tests compositional accuracy, how well the model handles counting, colors, positions, attribute binding. ERNIE-Image without the Prompt Enhancer scores 0.8856 overall. That’s the highest in the comparison set, sitting above Qwen-Image at 0.8683 and FLUX.2-klein-9B at 0.8481. Position accuracy at 0.8550 is particularly strong, which matters for structured layouts where element placement is non-negotiable.
On OneIG-EN, which adds reasoning, style, and diversity into the mix, the picture is more competitive. Nano Banana 2.0 leads at 0.5780, Seedream 4.5 sits at 0.5760, and ERNIE-Image with PE comes in at 0.5750 essentially a three-way tie at the top. The text score of 0.9788 is the second highest in the field, just behind Seedream 4.5’s near-perfect 0.9980. Z-Image-Turbo actually scores higher on text at 0.9940 but falls behind on everything else.
LongTextBench is where the text rendering claim gets its clearest support. ERNIE-Image with PE scores 0.9733 average across English and Chinese, second only to Seedream 4.5 at 0.9882. For a fully open weight model at this size running on consumer hardware, that’s a meaningful result.
ERNIE-Image is not the single best model on any one benchmark. What it does is stay near the top across all three while remaining practical to actually run. That consistency is the point.
| Model | GenEval | OneIG-EN | LongText Avg |
|---|---|---|---|
| ERNIE-Image (w/ PE) | 0.8728 | 0.5750 | 0.9733 |
| ERNIE-Image (w/o PE) | 0.8856 | 0.5537 | 0.9636 |
| Seedream 4.5 | — | 0.5760 | 0.9882 |
| Nano Banana 2.0 | — | 0.5780 | 0.9650 |
| Qwen-Image | 0.8683 | 0.5390 | 0.9445 |
| FLUX.2-klein-9B | 0.8481 | 0.5324 | 0.5413 |
How to Run It Today
Diffusers is the quickest path. Install the latest version from GitHub, load either baidu/ERNIE-Image or baidu/ERNIE-Image-Turbo, and you’re generating in a few lines of Python. Set use_pe=True to enable the Prompt Enhancer, recommended resolution is 1024×1024 to start.
If you need a server setup, SGLang support is built in. Start the server with your model path and send generation requests via a simple curl command. You can also deploy the Prompt Enhancer separately using vLLM if you want faster PE inference independent of the DiT model, useful if you’re running this in a production pipeline.
Beyond that the ecosystem is already there. ComfyUI supports ERNIE-Image in its latest version with a workflow template available. Unsloth supports building GGUF weights if you want a lighter deployment path. AI-Toolkit supports fine-tuning if you want to adapt it to a specific style or domain.
What it can’t do
No doubt its a strong AI Model, but it still has limits. The biggest one is control. The Prompt Enhancer helps weaker prompts, but sometimes it rewrites your idea a bit too much and adds details you never asked for. So while the image may look better, it may not be exactly your vision if you don’t turn it off.
Text rendering is better than most open models, though it still slips on dense layouts, unusual fonts, or long poster-style text. You’ll still need a few reruns for anything production ready.
It also isn’t perfect with detailed spatial instructions. Things like exact object placement, character consistency across multiple images, or highly detailed scene relationships and while 24GB VRAM is reasonable for this level of quality, it still puts it out of reach for a lot of everyday users.
So yes, it’s excellent for an 8B model, but it still needs prompt tuning and a couple of retries when precision really matters.
Where this actually fits
ERNIE-Image is a practical model. If your use case require clean outputs, readable text, posters & structured layouts then this makes a lot of sense. The Prompt Enhancer alone lowers the effort needed to get decent results.
But if you’re looking for perfect realism, absolute control, or consistency across generations, you’ll still hit limits. I keep coming back to this that it’s not the most powerful model out there, but it’s one of the easier ones to actually use.




