back to top
HomeTechByteDance Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

ByteDance Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning

- Advertisement -

Most multimodal AI systems today are still collections of separate tools pretending to be one product.

One model generates images. Another edits them. A different one handles video. The entire stack works, but it often feels stitched together behind the scenes.

ByteDance just used a different approach. The company just released Lance, a new open multimodal model that tries to handle image generation, video generation, editing, and visual reasoning inside one native framework. The surprising part is not just the scope. It is the size. Lance runs with only 3 billion active parameters while still posting competitive numbers across image, video, and editing benchmarks.

The industry has spent the last two years building specialized AI systems for every separate media task imaginable. Lance is part of a growing push in the opposite direction: fewer models, more unified behavior, and systems that can move between understanding and generation.

The problem with multimodal models

A lot of multimodal AI products still work like several different systems hiding behind one interface.

If you want the AI to explain what happened in the video or answer questions about it, that usually becomes a pipeline entirely.

The result is that many multimodal products are really collections of specialized models passing information back and forth behind the scenes.

That setup works, but it becomes complex the moment you try building longer AI workflows. Context gets lost between systems. Outputs become inconsistent. One model may generate something another model struggles to understand later.

ByteDance is trying to simplify that with Lance. Instead of separating generation, editing, and reasoning into different stacks, the company trained one framework to handle all of them together. The same model can generate images, create video, edit both, and answer questions about visual content.

AI companies are slowly moving toward agents and autonomous workflows instead of single prompts. A system that can create and understand visual content inside the same model is much easier to plug into those workflows than a chain of disconnected tools.

What Lance actually is

Lance is a native multimodal model from ByteDance.

The model supports text-to-image generation, text-to-video generation, image editing, video editing, image understanding, and video understanding inside one framework. ByteDance says the model was trained from scratch using a staged multi task setup.

Lance runs with 3B active parameters, which is relatively small compared to many recent multimodal systems pushing into video generation. Despite that, ByteDance is positioning it directly against larger unified models like BAGEL, TUNA, and InternVL-U across image generation, editing, and video benchmarks.

via Lance Github Page

Some demos show standard text-to-video clips, others lean into multi-turn editing and visual reasoning. In one example, the model edits a video while preserving consistency across multiple changes.

Question: How many times did the person launch objects on the table? Options: (A) 3 (B) 2 (C) 4
Response: (A) 3

In another, it answers questions about object movement and repeated actions inside short clips. The model can also describe images, read charts, recognize license plates, and handle basic visual reasoning tasks.

That combination is really the point of Lance. ByteDance is not treating generation and understanding as separate products anymore. It wants one system moving between both naturally.

You May Like: MiniCPM-V 4.6: The 1.3B Model Running on Your Phone That Challenges Much Larger Rivals

What Benchmarks Show?

BenchmarkLance (3B)Notable Comparison
VBench (video generation)85.11Higher than Wan 2.1 (83.69) and Kling (81.85)
GenEval (image generation)0.90 overallComparable to larger unified models like TUNA
GEdit-Bench (image editing)7.30Competitive with much larger editing systems
Tasks supported6Image/video generation, editing, and understanding

Benchmark tables usually turn into noise after a while, especially in multimodal AI where every company picks slightly different tests to highlight. But a few numbers here are worth paying attention to.

Lance performs unusually well for a 3B model in video generation. On VBench, ByteDance reports a score of 85.11, higher than several much larger generation focused systems including Wan 2.1, Hunyuan Video, CogVideoX, and Kling. That does not automatically mean the videos look better in every scenario, but it does suggest the model is competitive far outside its weight class.

`lance ai model image generations
via Lance Github Page

The image generation scores are also solid. Lance matches or beats several unified multimodal models on GenEval while staying much smaller than systems like Qwen-Image. The editing benchmarks are probably the most important part though, because editing is where unified models usually start breaking down. Lance stays surprisingly competitive there too.

That said, some of the reasoning examples still feel uneven. A few outputs look genuinely impressive while others lose into the kind of overconfident explanations multimodal models still struggle with.

Question: How does a total solar eclipse look like from Earth?
Response: A total solar eclipse looks like a dark circle with a bright, white outer edge. This is a result of the Earth’s rotation and the Earth’s shadow, which partially obscures the sun’s light. The bright white edge of the eclipse is caused by the sun’s high energy and its interaction with the Earth’s atmosphere, while the dark part of the eclipse is due to the Earth’s shadow and the surrounding air currents. The solar eclipse’s shape, with its bright white edge and dark center, is similar to the shape of a full moon or a dark disk. It is a natural phenomenon that occurs in the atmosphere of the Earth and is an important part of the solar system.

The solar eclipse example in the demo section reads like the model partially understands the image while also inventing physics along the way.

Which makes Lance feel more believable. The model is ambitious, but it is still very obviously a first generation attempt at a unified system.

The ByteDance context

Lance also makes more sense once you remember which company built it.

ByteDance has been becoming one of the strongest video AI labs in the industry. TikTok gave the company an enormous understanding of short form video, editing patterns, motion, pacing, captions, transitions, and how people actually consume visual content online.

That does not automatically translate into frontier AI models, but it probably explains why ByteDance keeps showing up in video generation conversations alongside companies spending far more attention on branding their research.

You can see traces of that focus in Lance itself. The model is not just generating isolated clips. A lot of the demos revolve around editing consistency, object tracking, motion understanding, and multi-turn visual changes. Those are the kinds of problems that become important when AI systems move beyond making pretty images and start handling real media workflows.

The other interesting part is that ByteDance released Lance under Apache 2.0. That means developers can actually run, modify, and build on top of it.

The catch is that running it still is not lightweight in practice. ByteDance recommends GPUs with at least 40GB of VRAM for inference, so despite the 3B model headline, this is still aimed more at researchers and serious developers than casual local users.

Limitations

Lance is impressive, but it is not some all-in-one system that suddenly solves multimodal AI.

Some of the reasoning outputs still slips into hallucination territory, especially on more abstract visual questions. A few generations also look less polished than what larger specialized image or video models can produce today. That tradeoff is probably unavoidable right now. Unified models gain flexibility, but they usually give up some peak performance compared to systems trained for one narrow task.

There is also the hardware reality. ByteDance may call this lightweight compared to giant frontier systems, but the recommended setup still requires serious GPU memory. Most people are not casually running this on a gaming laptop.

Still, Lance feels important for where AI systems are heading. The industry spent the last few years building separate tools for generation, editing, and understanding. Models like this are trying to collapse those boundaries into one system that can move between tasks more naturally.

Whether Lance itself becomes widely adopted is harder to predict. But the direction behind it is becoming pretty clear. AI labs are no longer just chasing the best image model or the best video model. They want systems that can handle the entire workflow without switching brains every few seconds.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
Google Built Gemma 4 12B Without Multimodal Encoders

Google Built Gemma 4 12B Without Multimodal Encoders

0
Every multimodal model you've used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don't just remove them.Google actually removed them.Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. There is no vision encoder or audio encoder. One decoder handling everything.
MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

MiniMax M3 Shows What Happens When AI Stops Thinking in Turns

0
Most models quit around submission 30 because they stop finding improvement and exit on their own. That's what happened when MiniMax ran a CUDA kernel optimization task against a field of frontier models. Every model except two called it done within the first 30 submissions. M3's best result came on submission 145. After 24 hours. After multiple plateaus where the numbers stopped moving and a reasonable model would have concluded there was nothing left to find. That's the thing MiniMax released yesterday. An AI model with a 1M token context window, native multimodality, and apparently a problem with knowing when to stop.
Anthropic Files for an IPO. AI Is Entering Its Public Company Era

Anthropic Files for an IPO. AI Is Entering Its Public Company Era.

0
Anthropic has officially taken its first step toward becoming a public company. In a brief announcement on Monday, the company said it had confidentially submitted a draft S-1 registration statement to the U.S. Securities and Exchange Commission for a proposed initial public offering. The filing doesn't reveal a share price, a fundraising target, or even a timeline. For now, it simply gives Anthropic the option to go public once the SEC review process is complete. Just a few years ago, Anthropic was a small group of former OpenAI researchers trying to build an alternative vision for advanced AI. Today, it sits among the handful of companies shaping the industry's future and that's why this filing matters. It's one of the world's most influential AI labs beginning the transition from a privately funded research company to a business that may eventually answer to public shareholders. For most of the AI boom, the biggest bets were made behind closed doors. Venture firms, sovereign wealth funds, and tech giants supplied the capital while the public watched from the outside. Anthropic's filing suggests that era may be starting to change.