Netflix has a visual effects budget most film studios would kill for. They do not release open source AI tools for fun. When they do ship something publicly, it is worth paying attention.
VOID is their latest release. Video Object and Interaction Deletion. Point at an object in a video, and VOID removes it. Everything that object was doing to the world around it.
That last part is where every other tool has failed for years. Remove a person carrying a stack of boxes and the boxes hang in mid air. Remove a chair someone is sitting on and the person hovers. The physics of the scene breaks and the edit becomes unusable. Film editors have been cleaning this up by hand since video editing existed.
VOID does not just erase. It reasons about what should happen next. A vision language model looks at the scene first, identifies everything the removed object was physically affecting, and only then does the diffusion model generate what the world looks like without it. Remove the person, the boxes fall. Remove the chair, the person sits on the floor. The scene stays physically coherent.
Table of contents
The physics breakthrough
Most video removal tools work like a smart eraser. They look at the pixels around the removed object and fill the gap with something plausible. That works fine when the object is just sitting there. It falls apart the moment the object is doing something.
VOID approaches this differently. Before any inpainting happens, a vision language model reads the scene and asks a question most tools never bother with. What is this object actually affecting? A person carrying boxes affects the boxes. A ball mid-collision affects the trajectory of everything it is about to hit. A chair someone is sitting on affects where that person’s weight is going.
The answer to that question gets encoded into something called a quadmask. Four values, four regions. Zero marks what gets removed. 63 marks overlap regions. 127 marks everything causally affected by the removed object, the falling boxes, the displaced items, the changed trajectories. 255 marks what stays untouched. The diffusion model then generates a physically coherent version of the scene using that map as a guide.
That is the difference. Other tools guess what was behind the object. VOID reasons about what the scene should look like if the object was never there.
For most videos one pass is enough. For longer clips where object shapes start drifting over time, an optional second pass uses optical flow to warp the first pass output and stabilize shapes along the newly generated trajectories. It is not always necessary but it is there when you need it.
What this unlocks for creators
Film editors have been doing this work by hand for decades. A continuity error, an unwanted extra walking through the background, a prop that should not be in the shot. Fixing any of these in post has meant hours of frame by frame rotoscoping. VOID cuts that down to a masking step and an inference run.
For YouTubers and independent filmmakers the value is more immediate. Professional object removal has lived behind expensive software and even more expensive VFX artists. A tool that understands physics and runs on a single A100 changes that calculation significantly.
The less obvious use case is video dataset generation. Researchers building training data for robotics or autonomous systems need clean counterfactual examples, videos showing what a scene looks like with and without specific objects. VOID generates those automatically with physically consistent outcomes.
Related: Open Source AI Video Models for Editing and Generation
Hardware Requirements & how to run it
VOID requires a GPU with 40GB or more of VRAM. That means an A100 or equivalent. If you are on a consumer GPU, a 3090, a 4090, even a workstation card under 40GB, you cannot run this locally right now.
If you have access to the hardware, setup is straightforward. Clone the repo, run the included notebook and it handles everything. Downloads the models, runs inference on a sample video, shows you the result. The CLI is available for production pipelines.
You will need two things from HuggingFace. The base CogVideoX-Fun model at around 10GB and the VOID checkpoints at 22.3GB total. Pass 1 is the core inpainting model. Pass 2 is optional, only needed for longer clips where shape consistency becomes an issue.
For most people without access to an A100, the demo on the project page is the realistic option right now. Cloud GPU rentals on platforms like RunPod or Lambda Labs bring it within reach if you need to run it on real footage without owning the hardware.
You May Like: daVinci-MagiHuman Finally Makes Open-Source AI Video Feel Real
When tech giants contribute to open source
Netflix did not have to release this. They have the infrastructure to keep VOID internal, use it for their own productions, and leave independent creators with the same expensive workarounds they have always had. They shipped it anyway. Apache 2.0, weights on HuggingFace, full training code on GitHub.
That matters beyond this specific tool. When a company with Netflix’s resources open sources serious research, it sets a bar. Runway charges subscription fees for inferior object removal. VOID does something it cannot do. It understands the physics of what it is removing, and costs nothing to use if you have the hardware.
The 40GB VRAM requirement keeps it out of reach for most people today. That will change. Models get quantized, hardware gets cheaper, and someone will have a consumer-friendly wrapper running within months. The foundation Netflix shipped is serious enough that it will be worth revisiting when that happens.
If you are in video production or research right now and have access to an A100, there is no reason not to try it today. Everyone else, bookmark it and watch the GitHub.




