Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
Authors: Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, Wenhu Chen
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of Consist I2V over existing methods. We conduct extensive automatic and human evaluations on a self-collected dataset I2V-Bench to verify the effectiveness of our method for I2V generation. We conduct an ablation study on UCF-101 by iteratively disabling Frame Init, temporal first frame conditioning and spatial first frame conditioning. We follow the same experiment setups in Section 5.2 and show the results in Table 3. |
| Researcher Affiliation | Collaboration | 1University of Waterloo, 2Vector Institute, 301.AI EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and uses figures to illustrate the concepts, but it does not contain any explicit pseudocode or algorithm blocks. For example, Section 3 'Methodology' and its subsections detail the approach without formalized algorithm structures. |
| Open Source Code | No | The paper provides a project website: 'Project Website: https://tiger-ai-lab.github.io/Consist I2V/'. However, it does not explicitly state that the source code for the methodology described in the paper is available at this link or elsewhere. The website could be a demonstration page or project overview rather than a code repository. |
| Open Datasets | Yes | To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. We will release our evaluation dataset to foster future I2V generation research. We evaluate Consist I2V on two public datasets UCF-101 (Soomro et al., 2012) and MSR-VTT (Xu et al., 2016). |
| Dataset Splits | Yes | MSR-VTT (Xu et al., 2016) is an open-domain video retrieval and captioning dataset containing 10K videos, with 20 captions for each video. The standard splits for MSR-VTT include 6,513 training videos, 497 validation videos and 2,990 test videos. We use the official test split in the experiment and randomly select a text prompt for each video during evaluation. |
| Hardware Specification | Yes | We measure all models on a single Nvidia RTX 4090 with float32 and an inference batch size of 1. |
| Software Dependencies | No | The paper mentions 'Stable Diffusion 2.1-base' as a base model and refers to other models and samplers (e.g., 'DDIM sampler', 'Torch Metrics', 'CLIP-VIT-B/32 model', 'I3D model', 'Inception model', 'C3D model') but does not provide specific version numbers for multiple key ancillary software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use Stable Diffusion 2.1-base (Rombach et al., 2022) as the base T2I model to initialize Consist I2V and train the model on the Web Vid-10M (Bain et al., 2021) dataset... For each video, we sample 16 frames with a spatial resolution of 256 256 and a frame interval between 1 v 5... Our model is trained with the ϵ objective over all U-Net parameters using a batch size of 192 and a learning rate of 5e-5 for 170k steps. During training, we randomly drop input text prompts with a probability of 0.1 to enable classifier-free guidance (Ho & Salimans, 2022). During inference, we employ the DDIM sampler (Song et al., 2020) with 50 steps and classifier-free guidance with a guidance scale of w = 7.5 to sample videos. We apply Frame Init with τ = 850 and D0 = 0.25 for inference noise initialization. |