reproducibilityindex.ai

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

Authors: Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China 2University of California, Berkeley, United States 3The University of Hong Kong, Pokfulam, Hong Kong 4Baichuan Inc.
Pseudocode	No	The paper describes the architecture and methods using text and diagrams, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Codes and models are available at VDT-2023.github.io.
Open Datasets	Yes	The VDT is evaluated on both video generation and video prediction tasks. Unconditional generation results on the widely-used UCF101 (Soomro et al., 2012), Tai Chi (Siarohin et al., 2019) and Sky Time-Lapse (Xiong et al., 2018) datasets are provided for video synthesis. For video prediction, experiments are conducted on the real-world driven dataset Cityscapes (Cordts et al., 2016), as well as on a more challenging physical prediction dataset Physion (Bear et al., 2021) to demonstrate the VDT s strong prediction ability.
Dataset Splits	No	The paper mentions training on "train split" and "train + test split" but does not provide specific details on train/validation/test splits (e.g., percentages or sample counts) needed for full reproducibility of their dataset partitioning, nor does it explicitly mention a validation set.
Hardware Specification	Yes	We list training and inference times in Table 12, all experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions using a "pre-trained variational autoencoder (VAE) model (Rombach et al., 2022) as the tokenizer". However, it does not provide specific version numbers for software dependencies like programming languages, libraries, or other tools used in the experiments.
Experiment Setup	Yes	We empirically set the initial learning rate to 1e-4 and adopt Adam W (Loshchilov & Hutter, 2019) for our training. We utilize a pre-trained variational autoencoder (VAE) model (Rombach et al., 2022) as the tokenizer and freeze it during training. The hyper-parameters are uniformly set to Patchsize = 2. More details are given in Appendix. Table 13: Hyperparameters for each task.