back to top
HomeTechAI ModelsVOID: Netflix's open source AI removes objects and fixes the physics they...

VOID: Netflix’s open source AI removes objects and fixes the physics they break

- Advertisement -

Netflix has a visual effects budget most film studios would kill for. They do not release open source AI tools for fun. When they do ship something publicly, it is worth paying attention.

VOID is their latest release. Video Object and Interaction Deletion. Point at an object in a video, and VOID removes it. Everything that object was doing to the world around it.

That last part is where every other tool has failed for years. Remove a person carrying a stack of boxes and the boxes hang in mid air. Remove a chair someone is sitting on and the person hovers. The physics of the scene breaks and the edit becomes unusable. Film editors have been cleaning this up by hand since video editing existed.

VOID does not just erase. It reasons about what should happen next. A vision language model looks at the scene first, identifies everything the removed object was physically affecting, and only then does the diffusion model generate what the world looks like without it. Remove the person, the boxes fall. Remove the chair, the person sits on the floor. The scene stays physically coherent.

The physics breakthrough

Most video removal tools work like a smart eraser. They look at the pixels around the removed object and fill the gap with something plausible. That works fine when the object is just sitting there. It falls apart the moment the object is doing something.

VOID approaches this differently. Before any inpainting happens, a vision language model reads the scene and asks a question most tools never bother with. What is this object actually affecting? A person carrying boxes affects the boxes. A ball mid-collision affects the trajectory of everything it is about to hit. A chair someone is sitting on affects where that person’s weight is going.

The answer to that question gets encoded into something called a quadmask. Four values, four regions. Zero marks what gets removed. 63 marks overlap regions. 127 marks everything causally affected by the removed object, the falling boxes, the displaced items, the changed trajectories. 255 marks what stays untouched. The diffusion model then generates a physically coherent version of the scene using that map as a guide.

That is the difference. Other tools guess what was behind the object. VOID reasons about what the scene should look like if the object was never there.

For most videos one pass is enough. For longer clips where object shapes start drifting over time, an optional second pass uses optical flow to warp the first pass output and stabilize shapes along the newly generated trajectories. It is not always necessary but it is there when you need it.

What this unlocks for creators

Film editors have been doing this work by hand for decades. A continuity error, an unwanted extra walking through the background, a prop that should not be in the shot. Fixing any of these in post has meant hours of frame by frame rotoscoping. VOID cuts that down to a masking step and an inference run.

For YouTubers and independent filmmakers the value is more immediate. Professional object removal has lived behind expensive software and even more expensive VFX artists. A tool that understands physics and runs on a single A100 changes that calculation significantly.

The less obvious use case is video dataset generation. Researchers building training data for robotics or autonomous systems need clean counterfactual examples, videos showing what a scene looks like with and without specific objects. VOID generates those automatically with physically consistent outcomes.

Hardware Requirements & how to run it

VOID requires a GPU with 40GB or more of VRAM. That means an A100 or equivalent. If you are on a consumer GPU, a 3090, a 4090, even a workstation card under 40GB, you cannot run this locally right now.

If you have access to the hardware, setup is straightforward. Clone the repo, run the included notebook and it handles everything. Downloads the models, runs inference on a sample video, shows you the result. The CLI is available for production pipelines.

You will need two things from HuggingFace. The base CogVideoX-Fun model at around 10GB and the VOID checkpoints at 22.3GB total. Pass 1 is the core inpainting model. Pass 2 is optional, only needed for longer clips where shape consistency becomes an issue.

For most people without access to an A100, the demo on the project page is the realistic option right now. Cloud GPU rentals on platforms like RunPod or Lambda Labs bring it within reach if you need to run it on real footage without owning the hardware.

You May Like: daVinci-MagiHuman Finally Makes Open-Source AI Video Feel Real

When tech giants contribute to open source

Netflix did not have to release this. They have the infrastructure to keep VOID internal, use it for their own productions, and leave independent creators with the same expensive workarounds they have always had. They shipped it anyway. Apache 2.0, weights on HuggingFace, full training code on GitHub.

That matters beyond this specific tool. When a company with Netflix’s resources open sources serious research, it sets a bar. Runway charges subscription fees for inferior object removal. VOID does something it cannot do. It understands the physics of what it is removing, and costs nothing to use if you have the hardware.

The 40GB VRAM requirement keeps it out of reach for most people today. That will change. Models get quantized, hardware gets cheaper, and someone will have a consumer-friendly wrapper running within months. The foundation Netflix shipped is serious enough that it will be worth revisiting when that happens.

If you are in video production or research right now and have access to an A100, there is no reason not to try it today. Everyone else, bookmark it and watch the GitHub.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

YOU MAY ALSO LIKE
OpenMythos

OpenMythos: The Closest Thing to Claude Mythos You Can Run (And It’s Open Source)

0
Anthropic hasn't told anyone how Claude Mythos works. No architecture paper or model card with details. Just a product that keeps surprising people and a company that stays quiet about why. That silence has been driving the research community a little crazy. So one developer Kye Gomez did something about it. He read every public paper he could find on recurrent transformers, looped architectures, and inference-time scaling. He studied the behavioral patterns people were reporting from Mythos. Then he built what he thinks is inside it, published the code under MIT, and made it pip installable. It's called OpenMythos. It is not Claude Mythos. Gomez is explicit about that but the hypothesis behind it is serious, the architecture is real, and the reasoning for why Mythos might work this way is harder to dismiss than you'd expect.
Nucleus-Image AI image MOE model

Nucleus-Image: 17B Open-Source MoE Image Model Delivering GPT-Image Level Performance

0
The mixture-of-experts trick changed how people think about LLMs. Instead of running every parameter on every token, you activate a small fraction of the network per forward pass and somehow the quality stays competitive while the compute drops. It's the reason models like Mixtral punched above their weight. Everyone in the LLM space understood it immediately. Nobody had done it openly for image generation. Until now. Nucleus-Image is a 17B parameter diffusion transformer that activates roughly 2B parameters per forward pass. It beats Imagen4 on OneIG-Bench, sits at number one on DPG-Bench overall, and matches Qwen-Image on GenEval. It's also a base model. No fine-tuning, reinforcement learning or human preference tuning. What you're seeing in those benchmarks is raw pre-training performance. That's either impressive or a caveat depending on what you need it for, probably both.
ERNIE-Image Open-Source 8B Text-to-Image Model for Posters Comics and control

ERNIE-Image: Open-Source 8B Text-to-Image Model for Posters, Comics & Structured Generation

0
Text rendering in open source AI image generation has been broken for a long time. Ask most models to put readable words on a poster, lay out a comic panel, or generate anything where the text actually has to make sense and only few models can do it accurately and from rest you get something that looks like it was written by someone who learned the alphabet from a fever dream. ERNIE-Image is Baidu's answer to that specific problem. It's an 8B open weight text-to-image model built on a Diffusion Transformer and it's genuinely good at dense text, structured layouts, posters, infographics and multi-panel compositions. It can run on a 24GB consumer GPU, it's on Hugging Face right now, and it comes in two versions, a full quality model and a turbo variant that gets there in 8 steps instead of 50.

Don’t miss any Tech Story

Subscribe To Firethering NewsLetter

You Can Unsubscribe Anytime! Read more in our privacy policy