back to top
HomeTechByteDance Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

ByteDance Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

- Advertisement -

Most multimodal AI systems today are still collections of separate tools pretending to be one product.

One model generates images. Another edits them. A different one handles video. The entire stack works, but it often feels stitched together behind the scenes.

ByteDance just used a different approach. The company just released Lance, a new open multimodal model that tries to handle image generation, video generation, editing, and visual reasoning inside one native framework. The surprising part is not just the scope. It is the size. Lance runs with only 3 billion active parameters while still posting competitive numbers across image, video, and editing benchmarks.

The industry has spent the last two years building specialized AI systems for every separate media task imaginable. Lance is part of a growing push in the opposite direction: fewer models, more unified behavior, and systems that can move between understanding and generation.

The problem with multimodal models

A lot of multimodal AI products still work like several different systems hiding behind one interface.

If you want the AI to explain what happened in the video or answer questions about it, that usually becomes a pipeline entirely.

The result is that many multimodal products are really collections of specialized models passing information back and forth behind the scenes.

That setup works, but it becomes complex the moment you try building longer AI workflows. Context gets lost between systems. Outputs become inconsistent. One model may generate something another model struggles to understand later.

ByteDance is trying to simplify that with Lance. Instead of separating generation, editing, and reasoning into different stacks, the company trained one framework to handle all of them together. The same model can generate images, create video, edit both, and answer questions about visual content.

AI companies are slowly moving toward agents and autonomous workflows instead of single prompts. A system that can create and understand visual content inside the same model is much easier to plug into those workflows than a chain of disconnected tools.

What Lance actually is

Lance is a native multimodal model from ByteDance.

The model supports text-to-image generation, text-to-video generation, image editing, video editing, image understanding, and video understanding inside one framework. ByteDance says the model was trained from scratch using a staged multi task setup.

Lance runs with 3B active parameters, which is relatively small compared to many recent multimodal systems pushing into video generation. Despite that, ByteDance is positioning it directly against larger unified models like BAGEL, TUNA, and InternVL-U across image generation, editing, and video benchmarks.

via Lance Github Page

Some demos show standard text-to-video clips, others lean into multi-turn editing and visual reasoning. In one example, the model edits a video while preserving consistency across multiple changes.

Question: How many times did the person launch objects on the table? Options: (A) 3 (B) 2 (C) 4
Response: (A) 3

In another, it answers questions about object movement and repeated actions inside short clips. The model can also describe images, read charts, recognize license plates, and handle basic visual reasoning tasks.

That combination is really the point of Lance. ByteDance is not treating generation and understanding as separate products anymore. It wants one system moving between both naturally.

You May Like: MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

What Benchmarks Show?

BenchmarkLance (3B)Notable Comparison
VBench (video generation)85.11Higher than Wan 2.1 (83.69) and Kling (81.85)
GenEval (image generation)0.90 overallComparable to larger unified models like TUNA
GEdit-Bench (image editing)7.30Competitive with much larger editing systems
Tasks supported6Image/video generation, editing, and understanding

Benchmark tables usually turn into noise after a while, especially in multimodal AI where every company picks slightly different tests to highlight. But a few numbers here are worth paying attention to.

Lance performs unusually well for a 3B model in video generation. On VBench, ByteDance reports a score of 85.11, higher than several much larger generation focused systems including Wan 2.1, Hunyuan Video, CogVideoX, and Kling. That does not automatically mean the videos look better in every scenario, but it does suggest the model is competitive far outside its weight class.

`lance ai model image generations
via Lance Github Page

The image generation scores are also solid. Lance matches or beats several unified multimodal models on GenEval while staying much smaller than systems like Qwen-Image. The editing benchmarks are probably the most important part though, because editing is where unified models usually start breaking down. Lance stays surprisingly competitive there too.

That said, some of the reasoning examples still feel uneven. A few outputs look genuinely impressive while others lose into the kind of overconfident explanations multimodal models still struggle with.

Question: How does a total solar eclipse look like from Earth?
Response: A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth’s rotation and the Earth’s shadow, which partially obscures the sun’s light. The bright white edge of the eclipse is caused by the sun’s high energy and its interaction with the Earth’s atmosphere, while the dark part of the eclipse is due to the Earth’s shadow and the surrounding air currents. The solar eclipse’s shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.

The solar eclipse example in the demo section reads like the model partially understands the image while also inventing physics along the way.

Which makes Lance feel more believable. The model is ambitious, but it is still very obviously a first generation attempt at a unified system.

The ByteDance context

Lance also makes more sense once you remember which company built it.

ByteDance has been becoming one of the strongest video AI labs in the industry. TikTok gave the company an enormous understanding of short form video, editing patterns, motion, pacing, captions, transitions, and how people actually consume visual content online.

That does not automatically translate into frontier AI models, but it probably explains why ByteDance keeps showing up in video generation conversations alongside companies spending far more attention on branding their research.

You can see traces of that focus in Lance itself. The model is not just generating isolated clips. A lot of the demos revolve around editing consistency, object tracking, motion understanding, and multi-turn visual changes. Those are the kinds of problems that become important when AI systems move beyond making pretty images and start handling real media workflows.

The other interesting part is that ByteDance released Lance under Apache 2.0. That means developers can actually run, modify, and build on top of it.

The catch is that running it still is not lightweight in practice. ByteDance recommends GPUs with at least 40GB of VRAM for inference, so despite the 3B model headline, this is still aimed more at researchers and serious developers than casual local users.

Limitations

Lance is impressive, but it is not some all-in-one system that suddenly solves multimodal AI.

Some of the reasoning outputs still slips into hallucination territory, especially on more abstract visual questions. A few generations also look less polished than what larger specialized image or video models can produce today. That tradeoff is probably unavoidable right now. Unified models gain flexibility, but they usually give up some peak performance compared to systems trained for one narrow task.

There is also the hardware reality. ByteDance may call this lightweight compared to giant frontier systems, but the recommended setup still requires serious GPU memory. Most people are not casually running this on a gaming laptop.

Still, Lance feels important for where AI systems are heading. The industry spent the last few years building separate tools for generation, editing, and understanding. Models like this are trying to collapse those boundaries into one system that can move between tasks more naturally.

Whether Lance itself becomes widely adopted is harder to predict. But the direction behind it is becoming pretty clear. AI labs are no longer just chasing the best image model or the best video model. They want systems that can handle the entire workflow without switching brains every few seconds.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Open Source AI Coding Agents That Don't Need a Subscription

7 Open Source AI Coding Agents That Don’t Need a Subscription

0
Open almost any "best AI coding tools" list and you'll see the same names: Cursor, GitHub Copilot, Claude Code. They're good tools but they're also closed source and paid. What's changed over the past year isn't the quality of those products, it's how quickly the open-source alternatives have caught up. Some can orchestrate multiple agents, remember your projects across sessions, and automate complex development workflows. Many let you bring your own model, whether that's a local LLM, OpenRouter, OpenAI, GLM-5.2, Ornith, DeepSeek, or something else entirely. More importantly, you're in control. You decide where your code runs, which model powers it, and how your workflow evolves without being locked into a single company's ecosystem. If you've only looked at the paid options, these are the open-source AI coding tools worth knowing about.
Ornith Coding model that beats Claude opus 4.7

Ornith 1.0: The New Open-Source AI Model for Agentic Coding

0
Most reinforcement learning setups for coding models work the same way. Researchers build a harness, a fixed scaffold that tells the model how to approach a category of task, then the model gets rewarded for solving problems inside that structure. The harness stays fixed. Only the model's answers change. Ornith-1.0, a new open-source coding model family from DeepReinforce is not just about coding, Instead the model writes its own scaffold. At every training step, it looks at the task in front of it and the scaffold it used last time, then proposes a better version of that scaffold before even attempting an answer. The reward doesn't just grade the solution. It grades the scaffold that produced it. That's a small architectural choice with a strange consequence. A model that gets to design its own training process can, in theory, design one that cheats the verifier instead of solving the actual problem, and DeepReinforce is upfront that this happened during training. The fix they built for it is also worth understanding before getting to the benchmark numbers.
OpenAI Built Its First AI Chip. It's Not Trying to Replace NVIDIA

OpenAI Built Its First AI Chip. It’s Not Trying to Replace NVIDIA.

0
When the news broke that OpenAI had built a custom chip, the instinct was to frame it as a NVIDIA story. Another lab trying to cut the cord, reduce dependence on H100s, claw back some margin from the company that's been printing money off the AI boom. That's not quite what's happening here. The chip is called Jalapeño, built with Broadcom, and it doesn't touch training at all. It's an inference chip, meaning it only runs models after they're already built, when a user sends a message and ChatGPT has to respond. The compute-heavy work of actually training those models still runs on NVIDIA hardware. OpenAI isn't replacing NVIDIA. It's going after a different part of the problem entirely, the part that happens millions of times a day, every time someone uses one of their products. That distinction matters because inference is where AI costs actually accumulate at scale. Training happens once per model. Inference never stops.