Make-A-Video: Text-to-Video Generation without Text-Video Data

Authors: Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS 4.1 DATASETS AND SETTINGS 4.2 QUANTITATIVE RESULTS 4.3 QUALITATIVE RESULTS 6.1 DISENTANGLING EFFICACY OF THE T2I AND I2V COMPONENTS 6.2 ABLATION STUDY 6.3 EFFECTS OF DIFFERENT COMPONENTS
Researcher Affiliation Collaboration Uriel Singer + Adam Polyak + Thomas Hayes + Xi Yin + Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni Devi Parikh + Sonal Gupta + Yaniv Taigman + Corresponding author: urielsinger@meta.com. Jie and Songyang are from University of Rochester (work done during internship at Meta).
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper mentions that 'Video samples are available at make-a-video.github.io' and 'More video generation examples and applications can be found here: make-a-video.github.io'. This links to a project website with qualitative results and demos, but does not provide an explicit statement or link for the source code of the described methodology.
Open Datasets Yes To train the image models, we use a 2.3B subset of the dataset from (Schuhmann et al.) where the text is English. We use Web Vid-10M (Bain et al., 2021) and a 10M subset from HD-VILA-100M (Xue et al., 2022) to train our video generation models. Note that only the videos (no aligned text) are used. The decoder Dt and the interpolation model is trained on Web Vid-10M. SRt l is trained on both Web Vid-10M and HD-VILA-10M.
Dataset Splits No The paper mentions using UCF-101 and MSR-VTT for automatic evaluation, and a custom human evaluation set. It states 'all 59, 794 captions from the test set are used' for MSR-VTT, but does not provide specific details on validation dataset splits (e.g., percentages, sample counts, or predefined validation splits with citations) for reproducibility.
Hardware Specification No The acknowledgements section thanks for 'providing extra compute for our experimentation', but the paper does not specify any particular hardware details such as GPU models, CPU models, or memory specifications used for training or inference.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other relevant tools required to reproduce the experiments.
Experiment Setup Yes Table 7: Hyperparameters for the models provides detailed settings for P, D, Dt, F, SRl, SRt l, SRh including Diffusion steps, Objective, Sampling steps, Model size, Channels, Depth, Channels multiple, Heads channels, Attention resolution, Text encoder context/width/depth/heads, Dropout, Weight decay, Batch size, Iterations, Learning rate, Adam β2, Adam ϵ, EMA decay, and Model Parameters.