SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
Authors: Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. |
| Researcher Affiliation | Academia | 1 Shanghai Artificial Intelligence Laboratory, 2 East China Normal University 3 Shanghai Jiao Tong University, 4 Dept of Data Science & AI, Monash University 5 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 6 S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper does not contain structured pseudocode or explicitly labeled algorithm blocks. |
| Open Source Code | Yes | Project page: https://vchitect.github.io/SEINE-project/. |
| Open Datasets | Yes | We first utilize the Web Vid10M dataset (Bain et al., 2021) as the main training set... We employ the MSR-VTT dataset... on the UCF101 dataset (Soomro et al., 2012). |
| Dataset Splits | No | The paper mentions using Web Vid10M, MSR-VTT, and UCF101 datasets, and refers to a ‘test set’ for MSR-VTT and ‘training set’ for UCF-101, but does not provide specific split percentages or sample counts for training, validation, and test datasets to enable reproduction of the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using La Vie-base and Stable Diffusion models but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries required for reproduction. |
| Experiment Setup | Yes | Our model is trained on videos of 320 512 resolution with 16 frames. In our model, we set p = 0.15... Our results are generated by the DDIM sampling of 100 steps. |