Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MiniMax-Remover: Taming Bad Noise Helps Video Object Removal
Authors: Bojia Zi, Weixuan Peng, Xianbiao Qi, Jianan Wang, Shihao Zhao, Rong Xiao, Kam-Fai Wong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness and superiority of Mini Max-Remover compared to existing methods. |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong 2Shenzhen University 3Intelli Fusion Inc. 4Astribot Inc. 5The University of Hong Kong |
| Pseudocode | Yes | Pseudo Codes for Stage-1 Model. |
| Open Source Code | Yes | Codes and Videos are available at: https://minimax-remover.github.io. |
| Open Datasets | Yes | In Stage 1, we use Grounded-SAM2 [29, 42] and captions from Cog VLM2 [19] to generate masks on the watermark-free Web Vid-10M dataset [1]. Approximately 2.5M video-mask pairs are randomly selected for training. In Stage 2, we collect 17K videos from Pexels [39] and apply the same annotation process as in Stage 1. These are further processed using the model from Stage 1, and 10K videos are manually selected for Stage 2 training. We evaluate these metrics on DAVIS datasets and 200 randomly selected Pexels videos to show generalizations across different datasets. |
| Dataset Splits | No | The paper describes the data used for training and evaluation: "Approximately 2.5M video-mask pairs are randomly selected for training." for Stage 1, and "10K videos are manually selected for Stage 2 training." with a mixture of "one-third of training samples are drawn from our curated 10K set with their associated adversarial noises, while the remaining two-thirds are standard Web Vid-10M videos with randomly generated object masks." for Stage 2. For evaluation, it states, "We evaluate these metrics on DAVIS datasets and 200 randomly selected Pexels videos to show generalizations across different datasets." However, it does not provide explicit train/test/validation splits (e.g., percentages or counts) for any single dataset. |
| Hardware Specification | Yes | All experiments are conducted on 8 A800 GPUs (80GB each) and take about two days in total. Inference Details. We perform inference using RTX 4090 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W optimizer [32] and referring to other models like Wan2.1-1.3B [44], but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Training Details. For Stage 1, we initialize our model with Wan2.1-1.3B [44]. Newly added layers, such as the embedding layer, are randomly initialized. The first 16 channels of the patch embedder are copied from Wan2.1, while the remaining 32 are zero-initialized. Training uses a batch size of 128, input frame length of 81, and resolutions randomly sampled from 336 592 to 720 1280. We set the first N mask frames to 0 to support the any-length inpainting by applying sliding windows, using a random ratio of 0.1. We use Adam W optimizer [32] with a constant learning rate of 1e 5, weight decay of 1e 4, and train for 10K steps. In Stage 2, we reuse the model from Stage 1, excluding the embedding layer since no external conditions are needed. One-third of the training iterations apply min-max optimization; the rest follow standard training using unrelated masks from Web Vid [1]. Hyperparameters remain the same as in Stage 1. |