Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
Authors: Max Ku, Cong Wei, Weiming Ren, Huan Yang, Wenhu Chen
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation shows that Any V2V achieved CLIP-scores comparable to other baseline methods. Furthermore, Any V2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks. The code is available at https://github.com/TIGER-AI-Lab/Any V2V. ... We show both quantitatively and qualitatively that our method outperforms existing SOTA baselines in Section 5.3 and Appendix B. |
| Researcher Affiliation | Collaboration | Max Ku , Cong Wei , Weiming Ren , Harry Yang, Wenhu Chen University of Waterloo, Vector Institute, Harmony.AI EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and mathematical formulations (Equations 1-4) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/TIGER-AI-Lab/Any V2V. |
| Open Datasets | Yes | Our human evaluation dataset contains a total of 89 samples that have been collected from https://www.pexels.com. For prompt-based editing, we employed Instruct Pix2Pix (Brooks et al., 2023) to compose the examples. Topics include swapping objects, adding objects, and removing objects. For subject-driven editing, we employed Any Door (Chen et al., 2023c) to replace objects with reference subjects. For Neural Style Transfer, we employed NST (Gatys et al., 2015) to compose the examples. For identity manipulation, we employed Instant ID (Wang et al., 2024b) to compose the examples. See Table 4 for the statistic. ... For the human evaluation dataset, the dataset has been collected from https://www.pexels.com, with all data governed by the terms outlined at https://www.pexels.com/license/. |
| Dataset Splits | No | The paper mentions a human evaluation dataset with specific numbers of entries for different categories (e.g., 45 for Prompt-based Editing, 20 for Reference-based Style Transfer), but it does not describe any training, validation, or test splits for model training or evaluation beyond these samples for human review. The models used (I2VGen-XL, Consist I2V, SEINE) are described as off-the-shelf, implying pre-trained models where the authors did not perform their own data splitting for training. |
| Hardware Specification | Yes | We conducted all the experiments on a single Nvidia A6000 GPU. |
| Software Dependencies | No | The paper lists various I2V generation models (I2VGen-XL, Consist I2V, SEINE) and image editing models (Instruct Pix2Pix, Neural Style Transfer, Any Door, Instant ID) used. It also mentions the DDIM sampler. However, it does not provide specific version numbers for underlying software libraries or programming languages (e.g., Python, PyTorch, TensorFlow versions) that are typically required for reproducibility. |
| Experiment Setup | Yes | For all I2V models, we use ฯconv = 0.2T, ฯsa = 0.2T and ฯta = 0.5T, where T is the total number of sampling steps. We use the DDIM (Song et al., 2020) sampler and set T to the default values of the selected I2V models. Following Pn P (Tumanyan et al., 2023b), we set l1 = 4 for convolution feature injection and l2 = l3 = {4, 5, 6, ..., 11} for spatial and temporal attention injections. During sampling, we apply text classifier-free guidance (CFG) (Ho & Salimans, 2022) for all models with the same negative prompt Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms across all edits. |