Make-A-Video: Text-to-Video Generation without Text-Video Data
Authors: Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 EXPERIMENTS 4.1 DATASETS AND SETTINGS 4.2 QUANTITATIVE RESULTS 4.3 QUALITATIVE RESULTS 6.1 DISENTANGLING EFFICACY OF THE T2I AND I2V COMPONENTS 6.2 ABLATION STUDY 6.3 EFFECTS OF DIFFERENT COMPONENTS |
| Researcher Affiliation | Collaboration | Uriel Singer + Adam Polyak + Thomas Hayes + Xi Yin + Jie An Songyang Zhang Qiyuan Hu Harry Yang Oron Ashual Oran Gafni Devi Parikh + Sonal Gupta + Yaniv Taigman + Corresponding author: urielsinger@meta.com. Jie and Songyang are from University of Rochester (work done during internship at Meta). |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper mentions that 'Video samples are available at make-a-video.github.io' and 'More video generation examples and applications can be found here: make-a-video.github.io'. This links to a project website with qualitative results and demos, but does not provide an explicit statement or link for the source code of the described methodology. |
| Open Datasets | Yes | To train the image models, we use a 2.3B subset of the dataset from (Schuhmann et al.) where the text is English. We use Web Vid-10M (Bain et al., 2021) and a 10M subset from HD-VILA-100M (Xue et al., 2022) to train our video generation models. Note that only the videos (no aligned text) are used. The decoder Dt and the interpolation model is trained on Web Vid-10M. SRt l is trained on both Web Vid-10M and HD-VILA-10M. |
| Dataset Splits | No | The paper mentions using UCF-101 and MSR-VTT for automatic evaluation, and a custom human evaluation set. It states 'all 59, 794 captions from the test set are used' for MSR-VTT, but does not provide specific details on validation dataset splits (e.g., percentages, sample counts, or predefined validation splits with citations) for reproducibility. |
| Hardware Specification | No | The acknowledgements section thanks for 'providing extra compute for our experimentation', but the paper does not specify any particular hardware details such as GPU models, CPU models, or memory specifications used for training or inference. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or other relevant tools required to reproduce the experiments. |
| Experiment Setup | Yes | Table 7: Hyperparameters for the models provides detailed settings for P, D, Dt, F, SRl, SRt l, SRh including Diffusion steps, Objective, Sampling steps, Model size, Channels, Depth, Channels multiple, Heads channels, Attention resolution, Text encoder context/width/depth/heads, Dropout, Weight decay, Batch size, Iterations, Learning rate, Adam β2, Adam ϵ, EMA decay, and Model Parameters. |