Learning to Reach Goals via Diffusion
Authors: Vineet Jain, Siamak Ravanbakhsh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally validate our approach in various offline goal-reaching tasks, demonstrating substantial performance enhancements compared to state-of-the-art methods while improving computational efficiency over other diffusion-based RL methods by an order of magnitude. Our results suggest that this perspective on diffusion for RL is a simple and scalable approach for sequential decision making2. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Mc Gill University, Montr eal, Canada 2Mila Quebec Artificial Institute Institute, Montr eal, Canada. Correspondence to: Vineet Jain <jain.vineet@mila.quebec>. |
| Pseudocode | Yes | Algorithm 1 Merlin algorithm. Red and blue statements apply only for Merlin-P and Merlin-NP, respectively. Purple statements apply to both. Algorithm 2 Detailed Merlin algorithm. Red and blue statements apply only for Merlin-P and Merlin-NP, respectively. Purple statements apply to both. |
| Open Source Code | Yes | Code for Merlin is available at https://github.com/ vineetjain96/merlin. |
| Open Datasets | Yes | We evaluate Merlin on several goal-conditioned control tasks using the benchmark introduced in Yang et al. (2021). The benchmark consists of two settings expert and random . The expert dataset consists of trajectories collected by a policy trained using online DDPG+HER with added Gaussian noise (σ = 0.2) to increase diversity, while the random dataset consists of trajectories collected by sampling random actions. |
| Dataset Splits | No | The paper mentions using an 'offline benchmark' and discusses 'ablations' and 'tuned values' but does not specify the explicit train/validation/test splits, percentages, or methodology for creating these splits within the datasets. |
| Hardware Specification | No | The paper states: 'Mila and the Digital Research Alliance of Canada provided computational resources.' This is a general statement and does not specify particular GPU models, CPU models, or detailed cloud instance specifications. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'GRU' but does not provide specific version numbers for any software, libraries, or programming languages. |
| Experiment Setup | Yes | The policy is parameterized as a diagonal Gaussian distribution using an MLP with three hidden layers of 256 units each with the Re LU activation function, except for the final layer. The input to the policy comprises the state, the desired goal, and the time horizon. The time horizon is encoded using sinusoidal positional embeddings of 32 dimensions with the maximum period set to T = 50 since that is the maximum length of the trajectory for all our tasks. The policy was trained for 500k mini-batch updates using Adam optimizer with a learning rate of 5 10 4 and a batch size of 512. The optimal values for the hindsight ratio and the time horizon are provided in Appendix D.2. |