reproducibilityindex.ai

Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions

Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, Dumitru Erhan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate Phenaki, we test it on the following tasks: 1) text conditional video generation, 2) textimage conditional video generation, 3) open domain time variable text conditional video generation (i.e.) story mode, 4) video quantization and 5) image conditional video generation a.k.a. video prediction.
Researcher Affiliation	Collaboration	Ruben Villegas Google Brain rubville@google.com Mohammad Babaeizadeh Google Brain mbz@google.com Pieter-Jan Kindermans Google Brain pikinder@google.com Hernan Moraldo Google Brain hmoraldo@google.com Han Zhang Google Brain zhanghan@google.com Mohammad Taghi Saffar Google Brain msaffar@google.com Santiago Castro University of Michigan sacastro@umich.edu Julius Kunze University College London juliuskunze@gmail.com Dumitru Erhan Google Brain dumitru@google.com
Pseudocode	No	The paper describes the architecture and processes, but does not provide formal pseudocode or algorithm blocks.
Open Source Code	No	Taken together, these issues contribute to our decision not to release the underlying models, code, data or interactive demo at this time.
Open Datasets	Yes	For image generation, there are datasets with billions of image-text pairs (such as LAION-5B [45] and JFT4B [67]) while the text-video datasets are substantially smaller e.g. Web Vid [4] with 10M videos... Unless speciﬁed otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M text-video pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45]... To evaluate the video encoding and reconstruction performance of C-Vi Vi T , we use the Momentsin-Time (Mi T) [33] dataset... For open domain videos, we test Phenaki on Kinetics600 [9]...
Dataset Splits	Yes	Mi T contains 802K training, 33K validation and 67K test videos at 25 FPS.
Hardware Specification	No	The paper does not provide specific details about the hardware used for training or experiments (e.g., GPU/CPU models, memory specifications). It mentions 'state-of-the-art computational capabilities' generally.
Software Dependencies	No	The paper mentions models and frameworks used (e.g., T5-XXL, VQ-GAN, Mask GIT) but does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Unless speciﬁed otherwise, we train a 1.8B parameter Phenaki model on a corpus of 15M textvideo pairs at 8 FPS mixed with 50M text-images plus 400M pairs of LAION-400M [45] (more details in Appendix B.3). The model used in the visualisations in this paper was trained for 1 million steps at a batch size of 512, which took less than 5 days. In this setup 80% of the training data came from the video dataset and each image dataset contributed 10%... we train using classiﬁer-free guidance by dropping the text condition 10% of the time during training [20, 65]... L = LVQ + 0.1 LAdv + 0.1 LIP + 1.0 LVP + 1.0 L2.