Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility

Authors: Bojia Zi, Shihao Zhao, Xianbiao Qi, Jianan Wang, Yukai Shi, Qianyu Chen, Bin Liang, Rong Xiao, Kam-Fai Wong, Lei Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that Co Co Co can create high-quality visual content with enhanced temporal consistency, improved text controllability, and better compatibility with personalized image models. Through extensive visualizations, quantitative comparisons, and experimental analyses, we have thoroughly demonstrated the performance of our Co Co Co framework. We have also conducted detailed ablation studies to analyze and verify the contributions of each of the modules we have proposed.
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong, 2The University of Hong Kong, 3International Digital Economy Academy, 4Intelli Fusion Inc.
Pseudocode No The paper describes its methodology using narrative text, mathematical formulations, and diagrams (e.g., Figure 2). It does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code https://github.com/zibojia/COCOCO
Open Datasets Yes We chose Web Vid-10M as our training set. For data cleaning, we use the Scenedetect library with a threshold of 20 to guarantee videos with a single scene and discard those with multiple scenes. For instance-aware region selection, we set the detection resolution to 396 512 and a bounding box and phrase selection threshold of 0.2. Training clips are sampled from three data types as described in Section with probabilities of 0.7, 0.2, and 0.1, respectively. Training Details. We use the Adam W (Loshchilov and Hutter 2017) optimizer with a learning rate of 1 10 4 and a constant scheduler. The model is trained for 1 epoch with a batch size of 256 using gradient accumulation. Following DDPM (Ho, Jain, and Abbeel 2020), we use 1000 steps. The Stable Diffusion Inpainting V1.5 model initializes the spatial block, which remains unchanged while we train the motion block. Temporal attention layers are initialized with Animate Diff V2, and the damped global attention and cross-attention layers with Kaiming initialization. Training is done at a resolution of 256 384, with a sample stride of 4 and 16 frames. Inference Details. In the inference stage, we follow DDIM (Song, Meng, and Ermon 2021), use a 50 sampling steps and the classifier-free guidance scale is 14. The mask for per frame can be obtained by Grounding DINO (Liu et al. 2023) and SAM (Kirillov et al. 2023) automatically or provided by the users with any shape. For the video with resolution of 512 512 and 32 frames, the inpainting process can be finished within 1 minute on a Nvidia 4090 GPU. A car driving on the road. A can floating in the water. (a) Original (b) AVID (c) Ours Figure 7: Comparison with AVID. Red rectangles shows the inconsistency and poor text-alignment. To view videos, please check our extended version. Experimental Results We conduct extensive experiments to evaluate our method. In our experiments, we random select 1000 videos from the validation set of the Web Vid-10M (Bain et al. 2021), and extract the first 16 frames in each video with the sample rate of 4. We randomly generate the mask and prompt, and ask model to generate the visual content in the masked region.
Dataset Splits Yes We chose Web Vid-10M as our training set. In our experiments, we random select 1000 videos from the validation set of the Web Vid-10M (Bain et al. 2021), and extract the first 16 frames in each video with the sample rate of 4. We randomly generate the mask and prompt, and ask model to generate the visual content in the masked region. Training clips are sampled from three data types as described in Section with probabilities of 0.7, 0.2, and 0.1, respectively.
Hardware Specification Yes For the video with resolution of 512 512 and 32 frames, the inpainting process can be finished within 1 minute on a Nvidia 4090 GPU.
Software Dependencies No The paper mentions several software components and libraries like "Adam W", "DDPM", "DDIM", "Scenedetect library", "Grounding DINO", and "SAM", but does not provide specific version numbers for these tools or libraries.
Experiment Setup Yes We use the Adam W (Loshchilov and Hutter 2017) optimizer with a learning rate of 1 10 4 and a constant scheduler. The model is trained for 1 epoch with a batch size of 256 using gradient accumulation. Following DDPM (Ho, Jain, and Abbeel 2020), we use 1000 steps. Training is done at a resolution of 256 384, with a sample stride of 4 and 16 frames. In the inference stage, we follow DDIM (Song, Meng, and Ermon 2021), use a 50 sampling steps and the classifier-free guidance scale is 14.