VDT: General-purpose Video Diffusion Transformers via Mask Modeling
Authors: Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2University of California, Berkeley, United States 3The University of Hong Kong, Pokfulam, Hong Kong 4Baichuan Inc. |
| Pseudocode | No | The paper describes the architecture and methods using text and diagrams, but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes and models are available at VDT-2023.github.io. |
| Open Datasets | Yes | The VDT is evaluated on both video generation and video prediction tasks. Unconditional generation results on the widely-used UCF101 (Soomro et al., 2012), Tai Chi (Siarohin et al., 2019) and Sky Time-Lapse (Xiong et al., 2018) datasets are provided for video synthesis. For video prediction, experiments are conducted on the real-world driven dataset Cityscapes (Cordts et al., 2016), as well as on a more challenging physical prediction dataset Physion (Bear et al., 2021) to demonstrate the VDT s strong prediction ability. |
| Dataset Splits | No | The paper mentions training on "train split" and "train + test split" but does not provide specific details on train/validation/test splits (e.g., percentages or sample counts) needed for full reproducibility of their dataset partitioning, nor does it explicitly mention a validation set. |
| Hardware Specification | Yes | We list training and inference times in Table 12, all experiments are conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using a "pre-trained variational autoencoder (VAE) model (Rombach et al., 2022) as the tokenizer". However, it does not provide specific version numbers for software dependencies like programming languages, libraries, or other tools used in the experiments. |
| Experiment Setup | Yes | We empirically set the initial learning rate to 1e-4 and adopt Adam W (Loshchilov & Hutter, 2019) for our training. We utilize a pre-trained variational autoencoder (VAE) model (Rombach et al., 2022) as the tokenizer and freeze it during training. The hyper-parameters are uniformly set to Patchsize = 2. More details are given in Appendix. Table 13: Hyperparameters for each task. |