Most multimodal AI systems today are still collections of separate tools pretending to be one product.
One model generates images. Another edits them. A different one handles video. The entire stack works, but it often feels stitched together behind the scenes.
ByteDance just used a different approach. The company just released Lance, a new open multimodal model that tries to handle image generation, video generation, editing, and visual reasoning inside one native framework. The surprising part is not just the scope. It is the size. Lance runs with only 3 billion active parameters while still posting competitive numbers across image, video, and editing benchmarks.
The industry has spent the last two years building specialized AI systems for every separate media task imaginable. Lance is part of a growing push in the opposite direction: fewer models, more unified behavior, and systems that can move between understanding and generation.
Table of Contents
The problem with multimodal models
A lot of multimodal AI products still work like several different systems hiding behind one interface.
If you want the AI to explain what happened in the video or answer questions about it, that usually becomes a pipeline entirely.
The result is that many multimodal products are really collections of specialized models passing information back and forth behind the scenes.
That setup works, but it becomes complex the moment you try building longer AI workflows. Context gets lost between systems. Outputs become inconsistent. One model may generate something another model struggles to understand later.
ByteDance is trying to simplify that with Lance. Instead of separating generation, editing, and reasoning into different stacks, the company trained one framework to handle all of them together. The same model can generate images, create video, edit both, and answer questions about visual content.
AI companies are slowly moving toward agents and autonomous workflows instead of single prompts. A system that can create and understand visual content inside the same model is much easier to plug into those workflows than a chain of disconnected tools.
What Lance actually is
Lance is a native multimodal model from ByteDance.
The model supports text-to-image generation, text-to-video generation, image editing, video editing, image understanding, and video understanding inside one framework. ByteDance says the model was trained from scratch using a staged multi task setup.
Lance runs with 3B active parameters, which is relatively small compared to many recent multimodal systems pushing into video generation. Despite that, ByteDance is positioning it directly against larger unified models like BAGEL, TUNA, and InternVL-U across image generation, editing, and video benchmarks.
Some demos show standard text-to-video clips, others lean into multi-turn editing and visual reasoning. In one example, the model edits a video while preserving consistency across multiple changes.

Response: (A) 3
In another, it answers questions about object movement and repeated actions inside short clips. The model can also describe images, read charts, recognize license plates, and handle basic visual reasoning tasks.
That combination is really the point of Lance. ByteDance is not treating generation and understanding as separate products anymore. It wants one system moving between both naturally.
You May Like: MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals
What Benchmarks Show?
| Benchmark | Lance (3B) | Notable Comparison |
|---|---|---|
| VBench (video generation) | 85.11 | Higher than Wan 2.1 (83.69) and Kling (81.85) |
| GenEval (image generation) | 0.90 overall | Comparable to larger unified models like TUNA |
| GEdit-Bench (image editing) | 7.30 | Competitive with much larger editing systems |
| Tasks supported | 6 | Image/video generation, editing, and understanding |
Benchmark tables usually turn into noise after a while, especially in multimodal AI where every company picks slightly different tests to highlight. But a few numbers here are worth paying attention to.
Lance performs unusually well for a 3B model in video generation. On VBench, ByteDance reports a score of 85.11, higher than several much larger generation focused systems including Wan 2.1, Hunyuan Video, CogVideoX, and Kling. That does not automatically mean the videos look better in every scenario, but it does suggest the model is competitive far outside its weight class.

The image generation scores are also solid. Lance matches or beats several unified multimodal models on GenEval while staying much smaller than systems like Qwen-Image. The editing benchmarks are probably the most important part though, because editing is where unified models usually start breaking down. Lance stays surprisingly competitive there too.
That said, some of the reasoning examples still feel uneven. A few outputs look genuinely impressive while others lose into the kind of overconfident explanations multimodal models still struggle with.

Response: A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth’s rotation and the Earth’s shadow, which partially obscures the sun’s light. The bright white edge of the eclipse is caused by the sun’s high energy and its interaction with the Earth’s atmosphere, while the dark part of the eclipse is due to the Earth’s shadow and the surrounding air currents. The solar eclipse’s shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.
The solar eclipse example in the demo section reads like the model partially understands the image while also inventing physics along the way.
Which makes Lance feel more believable. The model is ambitious, but it is still very obviously a first generation attempt at a unified system.
The ByteDance context
Lance also makes more sense once you remember which company built it.
ByteDance has been becoming one of the strongest video AI labs in the industry. TikTok gave the company an enormous understanding of short form video, editing patterns, motion, pacing, captions, transitions, and how people actually consume visual content online.
That does not automatically translate into frontier AI models, but it probably explains why ByteDance keeps showing up in video generation conversations alongside companies spending far more attention on branding their research.
You can see traces of that focus in Lance itself. The model is not just generating isolated clips. A lot of the demos revolve around editing consistency, object tracking, motion understanding, and multi-turn visual changes. Those are the kinds of problems that become important when AI systems move beyond making pretty images and start handling real media workflows.
The other interesting part is that ByteDance released Lance under Apache 2.0. That means developers can actually run, modify, and build on top of it.
The catch is that running it still is not lightweight in practice. ByteDance recommends GPUs with at least 40GB of VRAM for inference, so despite the 3B model headline, this is still aimed more at researchers and serious developers than casual local users.
Limitations
Lance is impressive, but it is not some all-in-one system that suddenly solves multimodal AI.
Some of the reasoning outputs still slips into hallucination territory, especially on more abstract visual questions. A few generations also look less polished than what larger specialized image or video models can produce today. That tradeoff is probably unavoidable right now. Unified models gain flexibility, but they usually give up some peak performance compared to systems trained for one narrow task.
There is also the hardware reality. ByteDance may call this lightweight compared to giant frontier systems, but the recommended setup still requires serious GPU memory. Most people are not casually running this on a gaming laptop.
Still, Lance feels important for where AI systems are heading. The industry spent the last few years building separate tools for generation, editing, and understanding. Models like this are trying to collapse those boundaries into one system that can move between tasks more naturally.
Whether Lance itself becomes widely adopted is harder to predict. But the direction behind it is becoming pretty clear. AI labs are no longer just chasing the best image model or the best video model. They want systems that can handle the entire workflow without switching brains every few seconds.




