VOID: Video Object and Interaction Deletion

VOID is a novel AI framework that removes objects from videos along with their physical interactions, ensuring natural scene reconstruction beyond simple visual erasure.
Saman Motamed1,2, William Harvey1, Benjamin Klein1, Luc Van Gool2, Zhuoning Yuan1, Ta-Ying Cheng1. 1Netflix 2INSAIT, Sofia University "St. Kliment Ohridski". VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning. Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally. VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency. VOID Pass 1: Base inpainting model. VOID Pass 2: Warped-noise refinement model. Requires a GPU with 40GB+ VRAM (e.g., A100). Stage 1 of the mask pipeline uses Gemini via the Google AI API and SAM2 for segmentation. The quadmask encodes four semantic regions per pixel: 0 (Primary object), 63 (Overlap), 127 (Affected region/interactions), and 255 (Background). Inference runs in two passes, with Pass 2 using optical flow-warped latents from Pass 1 to improve temporal consistency.
Source: Hacker News












