back to top
HomeTechERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

- Advertisement -

Text rendering in open source AI image generation has not been that useful. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense, only few models get it right and rest get you something that looks like it was written by someone who learned the alphabet from a fever dream.

ERNIE-Image is Baidu’s answer to that specific problem. It’s an 8B open weight text-to-image model built on a Diffusion Transformer and it’s genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions.

It can run on a 24GB consumer GPU, it’s on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

What Makes It Different

Most open source image models are chasing photorealism, aesthetics, prompt alignment. ERNIE-Image is doing that too but it has a specific angle that sets it apart from most of the field.

It was built with structured generation in mind. Posters, comics, infographics, multi-panel layouts, UI-like visuals, the kind of output where it’s not enough for the image to look good, it also has to be organized correctly. Text has to be readable. Elements have to be in the right place. The layout has to actually make sense.

That’s harder than it sounds. Most models treat text as just another visual element and the results show. Z-Image and Z-Image-Turbo are genuinely competitive on text rendering in the open source space too, and closed models like Seedream 4.5 and Nano Banana 2.0 sit ahead on some benchmarks. ERNIE-Image isn’t the only model solving this problem but at 8B parameters running on a 24GB consumer GPU, it’s one of the most practical open source options for anyone whose use case actually depends on getting text right.

The other thing worth mentioning is what 8B parameters means in practice here. This isn’t a compromised lightweight version of something bigger. It scores above Qwen-Image and FLUX.2-klein-9B on GenEval overall despite being in the same size range. Compact doesn’t mean weak in this case.

The Prompt Enhancer

ernie-image generations
via github/ERNIE-Image

One thing that quietly makes a real difference in day to day use is the built-in Prompt Enhancer. You give it a short casual description like a girl at the beach, a product shot on white background and before it ever reaches the DiT model, a lightweight language model rewrites that into a richer, more structured description that the image model can actually work with properly.

Most image models live or die by prompt quality. Write a vague prompt and you get a vague image. ERNIE-Image gives you a way around that without making you learn prompt engineering before you can use the tool.

You can turn it on or off. With PE enabled on the OneIG-EN benchmark the overall score jumps from 0.5537 to 0.5750, the reasoning and text scores improve the most, which makes sense since those are exactly the tasks that benefit from a more structured input description. It’s not magic. A well written prompt still beats a lazy one run through an enhancer. But for everyday use it lowers the floor considerably.

Two Versions, One Clear Choice

ERNIE-Image comes in two variants and the difference is straightforward. The standard model runs 50 inference steps with a guidance scale of 4.0. It’s the one you reach for when output quality is the priority with stronger instruction following, better general purpose capability, more reliable on complex prompts with multiple objects or detailed relationships.

ERNIE-Image-Turbo runs 8 steps with a guidance scale of 1.0, optimized using DMD and reinforcement learning. Faster generation, slightly higher aesthetic scores in some benchmarks, but trades some instruction fidelity to get there. If you’re iterating quickly or building something where speed matters more than precision, Turbo is the practical choice.

Both support the Prompt Enhancer. The weights are the same size 8B DiT parameters either way. You’re choosing a generation strategy not a different model architecture.

Related: AI Image Generators You Can Run on Consumer GPUs

What the Benchmarks Actually Show

Three benchmarks are worth looking at here and they tell slightly different stories.

On GenEval, which tests compositional accuracy, how well the model handles counting, colors, positions, attribute binding. ERNIE-Image without the Prompt Enhancer scores 0.8856 overall. That’s the highest in the comparison set, sitting above Qwen-Image at 0.8683 and FLUX.2-klein-9B at 0.8481. Position accuracy at 0.8550 is particularly strong, which matters for structured layouts where element placement is non-negotiable.

On OneIG-EN, which adds reasoning, style, and diversity into the mix, the picture is more competitive. Nano Banana 2.0 leads at 0.5780, Seedream 4.5 sits at 0.5760, and ERNIE-Image with PE comes in at 0.5750 essentially a three-way tie at the top. The text score of 0.9788 is the second highest in the field, just behind Seedream 4.5’s near-perfect 0.9980. Z-Image-Turbo actually scores higher on text at 0.9940 but falls behind on everything else.

LongTextBench is where the text rendering claim gets its clearest support. ERNIE-Image with PE scores 0.9733 average across English and Chinese, second only to Seedream 4.5 at 0.9882. For a fully open weight model at this size running on consumer hardware, that’s a meaningful result.

ERNIE-Image is not the single best model on any one benchmark. What it does is stay near the top across all three while remaining practical to actually run. That consistency is the point.

ModelGenEvalOneIG-ENLongText Avg
ERNIE-Image (w/ PE)0.87280.57500.9733
ERNIE-Image (w/o PE)0.88560.55370.9636
Seedream 4.50.57600.9882
Nano Banana 2.00.57800.9650
Qwen-Image0.86830.53900.9445
FLUX.2-klein-9B0.84810.53240.5413

How to Run It Today

Diffusers is the quickest path. Install the latest version from GitHub, load either baidu/ERNIE-Image or baidu/ERNIE-Image-Turbo, and you’re generating in a few lines of Python. Set use_pe=True to enable the Prompt Enhancer, recommended resolution is 1024×1024 to start.

If you need a server setup, SGLang support is built in. Start the server with your model path and send generation requests via a simple curl command. You can also deploy the Prompt Enhancer separately using vLLM if you want faster PE inference independent of the DiT model, useful if you’re running this in a production pipeline.

Beyond that the ecosystem is already there. ComfyUI supports ERNIE-Image in its latest version with a workflow template available. Unsloth supports building GGUF weights if you want a lighter deployment path. AI-Toolkit supports fine-tuning if you want to adapt it to a specific style or domain.

What it can’t do

No doubt its a strong AI Model, but it still has limits. The biggest one is control. The Prompt Enhancer helps weaker prompts, but sometimes it rewrites your idea a bit too much and adds details you never asked for. So while the image may look better, it may not be exactly your vision if you don’t turn it off.

Text rendering is better than most open models, though it still slips on dense layouts, unusual fonts, or long poster-style text. You’ll still need a few reruns for anything production ready.

It also isn’t perfect with detailed spatial instructions. Things like exact object placement, character consistency across multiple images, or highly detailed scene relationships and while 24GB VRAM is reasonable for this level of quality, it still puts it out of reach for a lot of everyday users.

So yes, it’s excellent for an 8B model, but it still needs prompt tuning and a couple of retries when precision really matters.

Where this actually fits

ERNIE-Image is a practical model. If your use case require clean outputs, readable text, posters & structured layouts then this makes a lot of sense. The Prompt Enhancer alone lowers the effort needed to get decent results.

But if you’re looking for perfect realism, absolute control, or consistency across generations, you’ll still hit limits. I keep coming back to this that it’s not the most powerful model out there, but it’s one of the easier ones to actually use.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Elon Musk Lost His OpenAI Lawsuit. The Jury Never Actually Decided If He Was Right

Elon Musk Lost His OpenAI Lawsuit. The Bigger Question Was Never Put to the...

0
Elon Musk spent months in a California courtroom trying to prove that Sam Altman stole a charity. He got nine jurors, weeks of testimony from some of the biggest names in Silicon Valley, and a front row seat to the most revealing airing of OpenAI's founding history ever put on public record. Then the jury came back in under two hours and told him he'd filed too late. Not that he was wrong. Not that Altman and Brockman acted properly. Just that whatever happened between them and Musk, the legal clock had already run out before he decided to do something about it. The question of whether OpenAI actually betrayed its founding mission, the question that made this case worth following in the first place never got answered.
Apple New Siri Could Auto-Delete Chats. Google Gemini Is Reportedly Under the Hood

Apple’s New Siri Could Auto-Delete Chats. Google Gemini Is Reportedly Under the Hood.

0
Apple has a Siri problem and everyone knows it. ChatGPT became a verb. Gemini is powering half the Android ecosystem. Claude is showing up in enterprise workflows. Meanwhile Siri is still struggling to set timers reliably. WWDC is in June and Apple is reportedly planning its biggest Siri overhaul yet. A standalone app, a proper chatbot experience, and a privacy pitch front and center. According to Bloomberg's Mark Gurman, Apple executives plan to argue they're taking a more privacy-friendly approach than every other AI company out there. That argument gets complicated quickly. The model powering this new Siri is Google Gemini.
zero language for ai agents

Vercel Built a Programming Language for AI Agents. The Compiler Speaks JSON.

0
Every serious coding agent including Claude Code, Cursor, Copilot, whatever you're using shares the same quiet problem. The agent writes code, the compiler throws an error, and the agent has to read text written for a human engineer to figure out what went wrong and how to fix it. That sounds like a minor inconvenience. In practice it's one of the main reasons agentic coding loops break down. Error message formats change between compiler versions. The same underlying problem gets described differently depending on context. There's no built-in concept of a repair action, just prose that an agent has to parse and hope it understood correctly. Vercel Labs just released Zero, an experimental systems language built from day one around the idea that the compiler should talk to agents as clearly as it talks to humans. Its Apache 2.0 licensed, available now and genuinely interesting even at v0.1.1.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy