VideoTetris: Towards Compositional Text-to-Video Generation

Authors: Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di ZHANG, Bin CUI

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that our Video Tetris achieves impressive qualitative and quantitative results in compositional T2V generation.
Researcher Affiliation Collaboration 1Peking University 2Kuaishou Technology
Pseudocode No The paper includes equations and structured prompt templates, but not a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code: https://github.com/YangLing0818/VideoTetris
Open Datasets Yes For the second scenario, we employed the core Control Net [22]-like branch from Streaming T2V [11] as the backbone and processed the Panda-70M [15] dataset using the Enhanced Video Data Preprocessing methods in section 3.2 as the training set.
Dataset Splits No The paper states Panda-70M as the training set and describes a method for generating test prompts, but does not explicitly provide details about training/validation/test splits of Panda-70M itself, nor an explicit validation set.
Hardware Specification Yes We trained our model with batch size = 2 and learning rate = 1e-5 on 4 A800 GPUs for 16k steps in total.
Software Dependencies No The paper mentions software like Control Net, Streaming T2V, Chat GPT3, GPT-4, and LLama-34, but does not provide specific version numbers for software dependencies needed for reproducibility, such as Python or PyTorch versions.
Experiment Setup Yes In training process, we randomly drop out 5% of text prompts for classifier-free guidance training. We trained our model with batch size = 2 and learning rate = 1e-5 on 4 A800 GPUs for 16k steps in total. ... The hyperparameters in section 3.2 and section 3.3 are shown in table 8.