FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing

Authors: Yuren Cong, Mengmeng Xu, christian simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, Sen He

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS", "Table 1: Quantitative results on TGVE-D and TGVE-V.", "We compare our approach with 5 publicly available text-to-video editing methods", "We conduct extensive experiments to validate the effectiveness of our method.
Researcher Affiliation Collaboration 1Leibniz University Hannover, 2Meta AI, 3The University of Hong Kong, 4Nanyang Technological University
Pseudocode No The paper describes the methodology in prose and diagrams (e.g., Figure 3, Figure 4) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes The project page is available at https://flatten-video-editing.github.io/.
Open Datasets Yes We evaluate our text-to-video editing framework with 53 videos sourced from LOVEUTGVE*. 16 of these videos are from DAVIS (Perazzi et al., 2016), and we denote this subset as TGVE-D. The other 37 videos are from Videvo, which are denoted as TGVE-V. *https://sites.google.com/view/loveucvpr23/track4
Dataset Splits No The paper uses 53 videos but does not provide explicit training, validation, and test dataset splits, only describing the total number and characteristics of the videos used for evaluation.
Hardware Specification Yes The runtime of the different models at different stages on a single A100 GPU is shown in Table 5.
Software Dependencies No The paper mentions using specific software like RAFT and xFormers but does not provide their version numbers.
Experiment Setup Yes We implement 100 timesteps for DDIM inversion and 50 timesteps for DDIM sampling.