Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate Phenaki, we test it on the following tasks: 1) text conditional video generation, 2) textimage conditional video generation, 3) open domain time variable text conditional video generation (i.e.) story mode, 4) video quantization and 5) image conditional video generation a.k.a. video prediction.
Researcher Affiliation Collaboration Ruben Villegas Google Brain rubville@google.com Mohammad Babaeizadeh Google Brain mbz@google.com Pieter-Jan Kindermans Google Brain pikinder@google.com Hernan Moraldo Google Brain hmoraldo@google.com Han Zhang Google Brain zhanghan@google.com Mohammad Taghi Saffar Google Brain msaffar@google.com Santiago Castro University of Michigan sacastro@umich.edu Julius Kunze University College London juliuskunze@gmail.com Dumitru Erhan Google Brain dumitru@google.com
Pseudocode No The paper describes the architecture and processes, but does not provide formal pseudocode or algorithm blocks.
Open Source Code No Taken together, these issues contribute to our decision not to release the underlying models, code, data or interactive demo at this time.
Open Datasets Yes For image generation, there are datasets with billions of image-text pairs (such as LAION-5B [45] and JFT4B [67]) while the text-video datasets are substantially smaller e.g. Web Vid [4] with 10M videos... Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M text-video pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45]... To evaluate the video encoding and reconstruction performance of C-Vi Vi T , we use the Momentsin-Time (Mi T) [33] dataset... For open domain videos, we test Phenaki on Kinetics600 [9]...
Dataset Splits Yes Mi T contains 802K training, 33K validation and 67K test videos at 25 FPS.
Hardware Specification No The paper does not provide specific details about the hardware used for training or experiments (e.g., GPU/CPU models, memory specifications). It mentions 'state-of-the-art computational capabilities' generally.
Software Dependencies No The paper mentions models and frameworks used (e.g., T5-XXL, VQ-GAN, Mask GIT) but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Unless specified otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M textvideo pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45] (more details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came from the video dataset and each image dataset contributed 10%... we train using classifier-free guidance by dropping the text condition 10% of the time during training [20, 65]... L = LVQ + 0.1 LAdv + 0.1 LIP + 1.0 LVP + 1.0 L2.