Probabilistic Adaptation of Black-Box Text-to-Video Models

Authors: Sherry Yang, Yilun Du, Bo Dai, Dale Schuurmans, Joshua B. Tenenbaum, Pieter Abbeel

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
Researcher Affiliation Collaboration Sherry Yang ,1,2, Yilun Du ,3, Bo Dai1, Dale Schuurmans1,4, Joshua B. Tenenbaum3, Pieter Abbeel2 1Google Deep Mind, 2UC Berkeley, 3MIT, 4University of Alberta
Pseudocode Yes Algorithm 1 Sampling algorithm of Video Adapter
Open Source Code Yes See website at https://video-adapter.github.io.
Open Datasets Yes For the Bridge (Ebert et al., 2021) we directly use the released opensource dataset. For Ego4D (Grauman et al., 2022) data, we take a small portion of the released dataset. For the Language Table dataset, we used the data from (Lynch et al., 2022).
Dataset Splits No For Ego4D, we take a subset of the original dataset consisting of 97k text-video pairs and split them into train (90%) and test (10%) to form DAdapt. For the Bridge Data, we take the entire dataset consisting of 7.2k text-video pairs and use the same train-test split to form DAdapt." (No explicit validation split is mentioned.)
Hardware Specification Yes The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size.
Software Dependencies No The paper mentions using T5-XXL, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We train each of our video diffusion models for 2M steps using batch size 2048 with learning rate 1e-4 and 10k linear warmup steps. The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size. We use noise schedule log SNR with range [-20, 20].