Towards Consistent Video Editing with Text-to-Image Diffusion Models

Authors: Zicheng Zhang, Bonan Li, Xuecheng Nie, Congying Han, Tiande Guo, Luoqi Liu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the superiority of the proposed EI2 model.
Researcher Affiliation Collaboration 1University of Chinese Academy of Sciences 2MT Lab, Meitu Inc.
Pseudocode No The paper does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states: 'Our implementation of EI2 is based on the stable diffusion v1-4 framework3. Stable Diffusion: https://huggingface.co/CompVis/stable-diffusion-v1-4'. This is a reference to a third-party framework used as a base, not the authors' own source code for EI2.
Open Datasets Yes Following previous works [56, 28], we collect videos from the DAVIS dataset [34] for comparison. We also gather face videos from the Pexels website to assess the fine-grained editing in the face domain. We utilize a captioning model [27] to automatically generate the text prompts.
Dataset Splits No The paper mentions 'perform tuning on 8-frame videos of size 512 × 512' and 'tune WQ of the FFAM and CA Modules, and all parameters of STAMs' but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts for each split).
Hardware Specification Yes All experiments are conducted on an NVIDIA Tesla V100 GPU.
Software Dependencies Yes Our implementation of EI2 is based on the stable diffusion v1-4 framework3. We utilize the Adam W optimizer.
Experiment Setup Yes We utilize the Adam W optimizer with a learning rate of 3e 5 for a total of 500 steps. During inference, we initialize the model from the DDIM inversion[15] and set the default classifier-free guidance [17] to 7.5.