Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Authors: Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4, we evaluate the representation learning capabilities of diffusion models on a broad range of embodied control tasks, ranging from purely vision-based tasks to problems that require an understanding of tasks through text prompts, thereby showcasing the versatility of diffusion model representations. |
| Researcher Affiliation | Academia | 1University of Oxford 2Georgia Institute of Technology 3New York University |
| Pseudocode | No | The paper describes its processes and methods in prose and with diagrams, but it does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: github.com/ykarmesh/stable-control-representations |
| Open Datasets | Yes | We use a small subset of the collection of datasets used by prior works on representation learning for embodied AI [27, 54]: we use subsets of the Epic Kitchens [9], Something-Something-v2 [SS-v2; 13], and Bridge-v2 [49] datasets. |
| Dataset Splits | Yes | The dataset uses 72 training and 14 validation scenes from the Gibson [53] scene dataset with evaluation conducted on a total of 4200 episodes. |
| Hardware Specification | Yes | We train our agents using the distributed version of PPO [52] with 152 environments spread across 4 80GB Nvidia A100 GPUs. Each run also has access to 96 CPUs and 754 GBs of RAM. |
| Software Dependencies | No | The paper mentions using the "diffusers library" and "huggingface CLIP finetuning implementation" but does not provide specific version numbers for these software dependencies in the text. |
| Experiment Setup | Yes | The training uses a mini-batch size of 256 and a learning rate of 10 3. |