Probabilistic Adaptation of Black-Box Text-to-Video Models
Authors: Sherry Yang, Yilun Du, Bo Dai, Dale Schuurmans, Joshua B. Tenenbaum, Pieter Abbeel
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. |
| Researcher Affiliation | Collaboration | Sherry Yang ,1,2, Yilun Du ,3, Bo Dai1, Dale Schuurmans1,4, Joshua B. Tenenbaum3, Pieter Abbeel2 1Google Deep Mind, 2UC Berkeley, 3MIT, 4University of Alberta |
| Pseudocode | Yes | Algorithm 1 Sampling algorithm of Video Adapter |
| Open Source Code | Yes | See website at https://video-adapter.github.io. |
| Open Datasets | Yes | For the Bridge (Ebert et al., 2021) we directly use the released opensource dataset. For Ego4D (Grauman et al., 2022) data, we take a small portion of the released dataset. For the Language Table dataset, we used the data from (Lynch et al., 2022). |
| Dataset Splits | No | For Ego4D, we take a subset of the original dataset consisting of 97k text-video pairs and split them into train (90%) and test (10%) to form DAdapt. For the Bridge Data, we take the entire dataset consisting of 7.2k text-video pairs and use the same train-test split to form DAdapt." (No explicit validation split is mentioned.) |
| Hardware Specification | Yes | The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size. |
| Software Dependencies | No | The paper mentions using T5-XXL, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We train each of our video diffusion models for 2M steps using batch size 2048 with learning rate 1e-4 and 10k linear warmup steps. The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size. We use noise schedule log SNR with range [-20, 20]. |