back to top
HomeTechAsymFlow Claims More Realistic AI Images by Moving Beyond Latent Diffusion

AsymFlow Claims More Realistic AI Images by Moving Beyond Latent Diffusion

- Advertisement -

At some point the field quietly agreed that pixel space was too hard and moved on.

Stable Diffusion, FLUX, every serious text-to-image model you’ve used in the last three years works in latent space. Instead of generating actual pixels directly, these models compress images into a smaller mathematical representation, do all the expensive work there, then decompress back to pixels at the end. It’s faster, it’s cheaper to train, and it made the current generation of image models possible.

The cost is subtle but noticable. That compression step loses information. Fine textures, sharp edges, precise details, things that live at the pixel level get smoothed over in ways that latent models can never fully recover because by the time they’re generating, those details are already gone.

Researchers at Stanford just published a way around this. AsymFlow doesn’t ask you to abandon your latent model or train a pixel model from scratch. It takes what you already have and converts it. And the result beats the latent model it started from.

The asymmetric trick that changes the math

Standard flow models predict velocity essentially the direction and speed the model should move from noise toward a clean image. The problem in pixel space is that predicting velocity means predicting both the data term and the noise term at full pixel resolution simultaneously. That’s an enormous amount of work for a transformer, most of which is spent modeling high-dimensional noise that doesn’t carry much useful information anyway.

AsymFlow splits that prediction asymmetrically. The data term stays full-dimensional because that’s where the actual image lives. The noise term gets restricted to a low-rank subspace, a mathematically smaller representation that captures the essential noise structure without the computational overhead of full pixel prediction. From those two asymmetric predictions, the full velocity gets recovered analytically without changing the network architecture or the training procedure.

The practical result is a model that does meaningful work in pixel space without paying the full computational cost that made pixel generation impractical in the first place. Think of it as finding the part of noise prediction that actually matters and ignoring the rest.

On ImageNet 256×256, this approach hits 1.57 FID, the best result among pixel diffusion models in the DiT and JiT family by a clear margin.

Surpassing FLUX.2 klein on its own benchmarks

Asymflow ai image generations
via: AsymFlow Github Repo

Finetuned from FLUX.2 klein 9B, AsymFLUX.2 klein is the pixel-space version of a model that already has serious capabilities. The finetuning works because AsymFlow aligns the latent space mathematically to a low-rank pixel subspace before training starts. The pixel model begins with the latent model’s full understanding of text, composition, and structure already intact. Finetuning then corrects the low-level detail that latent compression lost.

On HPSv3, which measures human preference for image quality and aesthetics, AsymFLUX.2 klein scores 10.66 against FLUX.2 klein base at 9.50. On DPG-Bench, which tests prompt adherence, it scores 86.8 against the base’s 85.2. On GenEval, 0.82 versus 0.80.

Those aren’t huge gaps but the direction matters. A pixel model finetuned from a latent base is beating that latent base on its own evaluation benchmarks. The detail and texture improvements you’d expect from pixel-space generation are showing up in the scores.

For context, FLUX.1 dev, a much larger and more established model sits at 10.43 on HPSv3. AsymFLUX.2 klein is above that.

You May Like: Open Source AI Image Editing Models That Challenge Google’s Nano Banana

What this means if you already have a latent model

The latent-to-pixel finetuning pathway AsymFlow introduces isn’t specific to FLUX.2 klein. The approach works by aligning any latent model’s compressed representation to a pixel subspace through a mathematical operation called Procrustes alignment. Once that initialization is done, the pixel model starts from a point where it already understands what it’s supposed to generate, it just needs to learn to generate it at full resolution.

That means every serious latent model that exists today is potentially a starting point for a pixel model. The expensive part, learning text-to-image generation at scale is already done. What remains is the finetuning, which is significantly cheaper than training from scratch.

Stanford released the code, the model weights for AsymFLUX.2 klein on Hugging Face, and a Gradio demo. One important detail before you build anything on top of this: AsymFLUX.2 klein inherits the FLUX Non-Commercial License, which means it’s free for research, personal projects, and non-production experimentation but not for commercial use. If you need it in a product, you’d need a separate commercial license from Black Forest Labs.

How to try it

The fastest path is the Hugging Face demo space which runs AsymFLUX.2 klein without any local setup. For local use, the repo provides a Diffusers-style pipeline — load the FLUX.2 klein base, attach the AsymFlow adapter, and generate directly in pixel space. The setup follows standard Diffusers conventions so if you’ve run FLUX locally before, this won’t feel unfamiliar.

Training your own version requires more infrastructure, 8 GPUs for the ImageNet experiments, and the text-to-image finetuning data preparation instructions aren’t fully published yet. For most people right now this is a model to evaluate and experiment with, not a training recipe to reproduce immediately.

You May Like: Open Source AI Models That Actually Get Text Right in Generated Images

Where it still has limits

AsymFLUX.2 klein is impressive on quality benchmarks and genuinely produces sharper, more detailed output than its latent base in qualitative comparisons. What it doesn’t do is dominate every category.

On GenEval it scores 0.82 against Qwen-Image which sits at 0.86. On raw prompt adherence for complex compositional tasks, larger dedicated models still have an edge. The finetuning corrects detail and texture well, it’s less clear how much it improves on harder reasoning-based generation tasks.

The setup also still requires a capable GPU. This isn’t a consumer laptop situation. And with ComfyUI support not yet available, the workflow options are more limited than what most practitioners are used to with FLUX-based models.

The research contribution here is more durable than any single benchmark result. A viable pathway from latent to pixel generation without retraining from scratch is a meaningful addition to what the field can do. The model itself is a solid first demonstration of that pathway working in practice.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Ornith Coding model that beats Claude opus 4.7

Ornith 1.0: The New Open-Source AI Model for Agentic Coding

0
Most reinforcement learning setups for coding models work the same way. Researchers build a harness, a fixed scaffold that tells the model how to approach a category of task, then the model gets rewarded for solving problems inside that structure. The harness stays fixed. Only the model's answers change. Ornith-1.0, a new open-source coding model family from DeepReinforce is not just about coding, Instead the model writes its own scaffold. At every training step, it looks at the task in front of it and the scaffold it used last time, then proposes a better version of that scaffold before even attempting an answer. The reward doesn't just grade the solution. It grades the scaffold that produced it. That's a small architectural choice with a strange consequence. A model that gets to design its own training process can, in theory, design one that cheats the verifier instead of solving the actual problem, and DeepReinforce is upfront that this happened during training. The fix they built for it is also worth understanding before getting to the benchmark numbers.
OpenAI Built Its First AI Chip. It's Not Trying to Replace NVIDIA

OpenAI Built Its First AI Chip. It’s Not Trying to Replace NVIDIA.

0
When the news broke that OpenAI had built a custom chip, the instinct was to frame it as a NVIDIA story. Another lab trying to cut the cord, reduce dependence on H100s, claw back some margin from the company that's been printing money off the AI boom. That's not quite what's happening here. The chip is called Jalapeño, built with Broadcom, and it doesn't touch training at all. It's an inference chip, meaning it only runs models after they're already built, when a user sends a message and ChatGPT has to respond. The compute-heavy work of actually training those models still runs on NVIDIA hardware. OpenAI isn't replacing NVIDIA. It's going after a different part of the problem entirely, the part that happens millions of times a day, every time someone uses one of their products. That distinction matters because inference is where AI costs actually accumulate at scale. Training happens once per model. Inference never stops.
glm 5.2 ai open weights

GLM-5.2 Is the Closest an Open Model Has Come to Claude

0
What does it take for an open-weight model to stop chasing Claude and actually beat it? Every open-weight release for two years has told some version of the same story: closer, but not quite. The chart shrinks, the wording softens to "competitive with," and the conversation moves on until the next model repeats the cycle. GLM-5.2 breaks that pattern. The model is built to survive long, messy coding work, the kind that runs for hours without losing the thread. That's the pitch its maker is leading with. But scroll down their own benchmark table and something else is sitting there quietly: on a couple of standard math evals, this open model isn't approaching Claude Opus 4.8, GPT-5.5, or Gemini 3.1 Pro. It's beating all three, on the same table. It loses plenty of ground elsewhere, and that part matters just as much as the wins. But a model anyone can download under an MIT license, with no usage restrictions attached, coming out ahead of the lab everyone else measures themselves against, is worth pausing on before getting to what the rest of the numbers actually say.