Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Probabilistic Adaptation of Black-Box Text-to-Video Models
Authors: Sherry Yang, Yilun Du, Bo Dai, Dale Schuurmans, Joshua B. Tenenbaum, Pieter Abbeel
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that, by incorporating broad knowledge and fidelity of the pretrained model probabilistically, a small model with as few as 1.25% parameters of the pretrained model can generate high-quality yet domain-specific videos for a variety of downstream domains such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. |
| Researcher Affiliation | Collaboration | Sherry Yang ,1,2, Yilun Du ,3, Bo Dai1, Dale Schuurmans1,4, Joshua B. Tenenbaum3, Pieter Abbeel2 1Google Deep Mind, 2UC Berkeley, 3MIT, 4University of Alberta |
| Pseudocode | Yes | Algorithm 1 Sampling algorithm of Video Adapter |
| Open Source Code | Yes | See website at https://video-adapter.github.io. |
| Open Datasets | Yes | For the Bridge (Ebert et al., 2021) we directly use the released opensource dataset. For Ego4D (Grauman et al., 2022) data, we take a small portion of the released dataset. For the Language Table dataset, we used the data from (Lynch et al., 2022). |
| Dataset Splits | No | For Ego4D, we take a subset of the original dataset consisting of 97k text-video pairs and split them into train (90%) and test (10%) to form DAdapt. For the Bridge Data, we take the entire dataset consisting of 7.2k text-video pairs and use the same train-test split to form DAdapt." (No explicit validation split is mentioned.) |
| Hardware Specification | Yes | The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size. |
| Software Dependencies | No | The paper mentions using T5-XXL, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We train each of our video diffusion models for 2M steps using batch size 2048 with learning rate 1e-4 and 10k linear warmup steps. The large 5.6B pretrained model requires 512 TPU-v4 chips, whereas various small models require anywhere between 8 and 256 TPU-v4 chips depending on the size. We use noise schedule log SNR with range [-20, 20]. |