reproducibilityindex.ai

Probabilistic Adaptation of Black-Box Text-to-Video Models

Authors: Sherry Yang, Yilun Du, Bo Dai, Dale Schuurmans, Joshua B. Tenenbaum, Pieter Abbeel

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
Researcher Affiliation	Collaboration	Sherry Yang ,1,2, Yilun Du ,3, Bo Dai1, Dale Schuurmans1,4, Joshua B. Tenenbaum3, Pieter Abbeel2 1Google Deep Mind, 2UC Berkeley, 3MIT, 4University of Alberta
Pseudocode	Yes	Algorithm 1 Sampling algorithm of Video Adapter
Open Source Code	Yes	See website at https://video-adapter.github.io.
Open Datasets	Yes	For the Bridge (Ebert et al., 2021) we directly use the released opensource dataset. For Ego4D (Grauman et al., 2022) data, we take a small portion of the released dataset. For the Language Table dataset, we used the data from (Lynch et al., 2022).
Dataset Splits	No	For Ego4D, we take a subset of the original dataset consisting of 97k text-video pairs and split them into train (90%) and test (10%) to form DAdapt. For the Bridge Data, we take the entire dataset consisting of 7.2k text-video pairs and use the same train-test split to form DAdapt." (No explicit validation split is mentioned.)
Hardware Specification	Yes	The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size.
Software Dependencies	No	The paper mentions using T5-XXL, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	We train each of our video diffusion models for 2M steps using batch size 2048 with learning rate 1e-4 and 10k linear warmup steps. The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size. We use noise schedule log SNR with range [-20, 20].