Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Elevating Flow-Guided Video Inpainting with Reference Generation

Authors: Suhwan Cho, Seoung Wug Oh, Sangyoun Lee, Joon-Young Lee

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a highquality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions.
Researcher Affiliation Collaboration 1 Yonsei University, Seoul, Korea 2 Adobe Research, San Jose, California, USA EMAIL, EMAIL
Pseudocode Yes For clarity, we also provide pseudo code in Algorithm 1. The proposed bi-directional pixel collection based on oneshot pixel pulling protocol takes masked images X, given masks M, and completed flows f as input.
Open Source Code Yes Code https://github.com/suhwan-cho/RGVI
Open Datasets Yes To quantitatively evaluate our method on realistic videos, we propose a high-quality VI benchmark dataset named HQVI. Instead of randomly corrupting videos with random objects or free-form masks (Zeng, Fu, and Chao 2020; Liu et al. 2021; Li et al. 2022), we carefully design each video sequence by blending foreground objects with the background video using alpha matte composition. We conduct extensive experiments on the HQVI, DAVIS 2016 (Perazzi et al. 2016), and You Tube-VOS 2018 (Xu et al. 2018) datasets to validate the effectiveness of our proposed approach.
Dataset Splits Yes To train the network, we utilize images from the You Tube-VOS 2018 dataset (Xu et al. 2018) training set, resized to a resolution of 240 432. Images are randomly selected and masked for training purposes, where the original images serve as ground truth, and the masked versions are used as inputs alongside binary masks. We prepare two commonly adopted datasets: a combination of the DAVIS 2016 training set and validation set (50 videos) and the You Tube-VOS 2018 testing set (508 videos).
Hardware Specification Yes All experiments are conducted on a single TITAN RTX GPU.
Software Dependencies No The paper mentions several models and optimizers like RAFT (Teed and Deng 2020), Stable Diffusion based on latent diffusion model (Rombach et al. 2022), Alex Net (Krizhevsky, Sutskever, and Hinton 2012) for LPIPS, and Adam optimizer (Kingma and Ba 2014), but it does not provide specific version numbers for software libraries or programming languages used in their own implementation.
Experiment Setup Yes To train the network, we utilize images from the You Tube-VOS 2018 dataset (Xu et al. 2018) training set, resized to a resolution of 240 432. Images are randomly selected and masked for training purposes, where the original images serve as ground truth, and the masked versions are used as inputs alongside binary masks. Our image corruption strategy includes two approaches: 1) random region free-form masking, akin to standard inpainting tasks; and 2) random object masking to simulate scenarios involving object removal. For training, we employ a straightforward combination of L1 loss and adversarial loss functions, optimized using the Adam optimizer (Kingma and Ba 2014) with a fixed learning rate of 1e-4. If the difference between the pulled colors from both directions falls below a threshold, we assign the average value to the target pixel location. Conversely, if there is disagreement (i.e., the difference exceeds the threshold), we identify these target pixels as unreliable and invalidate the propagation in subsequent steps. Note that the threshold value is empirically set to 1, with minimal observed variation across different values. At 240p and 480p input resolutions, RGVI without reference achieves the highest performance on traditional metrics (PSNR and SSIM).