Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
Authors: Gunshi Gupta, Karmesh Yadav, Yarin Gal, Dhruv Batra, Zsolt Kira, Cong Lu, Tim G. J. Rudner
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 4, we evaluate the representation learning capabilities of diffusion models on a broad range of embodied control tasks, ranging from purely vision-based tasks to problems that require an understanding of tasks through text prompts, thereby showcasing the versatility of diffusion model representations. |
| Researcher Affiliation | Academia | 1University of Oxford 2Georgia Institute of Technology 3New York University |
| Pseudocode | No | The paper describes its processes and methods in prose and with diagrams, but it does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: github.com/ykarmesh/stable-control-representations |
| Open Datasets | Yes | We use a small subset of the collection of datasets used by prior works on representation learning for embodied AI [27, 54]: we use subsets of the Epic Kitchens [9], Something-Something-v2 [SS-v2; 13], and Bridge-v2 [49] datasets. |
| Dataset Splits | Yes | The dataset uses 72 training and 14 validation scenes from the Gibson [53] scene dataset with evaluation conducted on a total of 4200 episodes. |
| Hardware Specification | Yes | We train our agents using the distributed version of PPO [52] with 152 environments spread across 4 80GB Nvidia A100 GPUs. Each run also has access to 96 CPUs and 754 GBs of RAM. |
| Software Dependencies | No | The paper mentions using the "diffusers library" and "huggingface CLIP finetuning implementation" but does not provide specific version numbers for these software dependencies in the text. |
| Experiment Setup | Yes | The training uses a mini-batch size of 256 and a learning rate of 10 3. |