Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ROSE: Remove Objects with Side Effects in Videos
Authors: Chenxuan Miao, Yutong Feng, Jianshu Zeng, Zixiang Gao, Hantang Liu, Yunfeng Yan, Donglian Qi, Xi Chen, Bin Wang, Hengshuang Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. 5 Experiments 5.1 Experiment Settings 5.2 Comparisons with Previous Methods 5.3 Ablation Study |
| Researcher Affiliation | Collaboration | 1Zhejiang University, 2Kun Byte AI, 3Peking University, 4The University of Hong Kong EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | No | The paper describes the framework of ROSE and its components verbally and with a diagram (Fig. 4), but it does not provide any structured pseudocode or algorithm blocks. For instance, it details: "We concatenate the noisy latents with the original input video and masks, consumed by a video inpainting model. An additional difference mask predictor is introduced to predict the correlated area in video, automatically computed from the input video pairs." |
| Open Source Code | Yes | The project page is https://rose2025-inpaint.github.io. The code and pre-trained models will be released upon acceptance, and the supplemental material contains detailed instructions to run experiments, including dataset preparation, model training, and evaluation. An anonymized Git Hub link is included for reviewing purposes. |
| Open Datasets | Yes | Our dataset contains 16,678 synthetic video pairs rendered in Unreal Engine, each 6 seconds (90 frames) at 1920 1080 resolution. ... To address these gaps, we construct ROSEBench, a comprehensive evaluation benchmark on video object removal, consisting of following subsets: (i) Synthetic paired benchmark tailored for evaluation under diverse physical interaction effects. ... (ii) Realistic paired benchmark constructed using a copy-and-paste strategy based on the video segmentation dataset dataset DAVIS [26]. ... (iii) Realistic unpaired benchmark containing real videos with masks. ... All data used in this work are synthetically generated within Unreal Engine, containing no real human, biometric, or copyrighted content. To prevent potential misuse, we plan to release the ROSE model and dataset under a research-only license. |
| Dataset Splits | Yes | Our dataset contains 16,678 synthetic video pairs rendered in Unreal Engine, each 6 seconds (90 frames) at 1920 1080 resolution. ... To address these gaps, we construct ROSEBench, a comprehensive evaluation benchmark on video object removal, consisting of following subsets: (i) Synthetic paired benchmark tailored for evaluation under diverse physical interaction effects. ... Every category contains 10 high-quality triplets of video sequences, i.e., original, edited, and mask videos, offering precise and controllable evaluation of model behavior under different side-effect conditions. |
| Hardware Specification | Yes | We fully train the model together with the difference mask predictor in 80000 optimization steps with 0.00002 learning rate on 4 NVIDIA H800 GPUs. ... All evaluations are conducted on 65-frame input videos with a resolution of 720 480, using float16 precision and NVIDIA H800 GPUs. |
| Software Dependencies | Yes | To tackle this problem, we propose to utilize the adequate 3D data together with advanced game engine, i.e., the Unreal Engine [7], to synthesize the paired video data. ... The backbone model is a controllable generation variant of Wan2.1 1.3B version [35]. |
| Experiment Setup | Yes | In the training process, we resize all the video pairs into the resolution of 720 480 and use 81 frames for training. The backbone model is a controllable generation variant of Wan2.1 1.3B version [35]. We fully train the model together with the difference mask predictor in 80000 optimization steps with 0.00002 learning rate on 4 NVIDIA H800 GPUs. ... During training, we use a batch size of 1 and randomly select a continuous sequence of 81 frames from triplets of the original, masked, and edited videos as input. ... The module is trained under MSE loss supervision against the ground-truth difference mask dt described in Eq. (3). It functions as an auxiliary self-localization signal to encourage the model to be sensitive to subtle visual effects introduced by object edits. Then the training objective of ROSE consists of two terms: the standard diffusion denoising loss and the auxiliary mask prediction loss: L = Et,z0,ϵ h ϵ ˆϵ 2 2 + λ ˆdt dt 2 2 i , (3) where λ balances the two objectives. |