reproducibilityindex.ai

Learning to Reach Goals via Diffusion

Authors: Vineet Jain, Siamak Ravanbakhsh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally validate our approach in various offline goal-reaching tasks, demonstrating substantial performance enhancements compared to state-of-the-art methods while improving computational efficiency over other diffusion-based RL methods by an order of magnitude. Our results suggest that this perspective on diffusion for RL is a simple and scalable approach for sequential decision making2.
Researcher Affiliation	Academia	1Department of Computer Science, Mc Gill University, Montr eal, Canada 2Mila Quebec Artificial Institute Institute, Montr eal, Canada. Correspondence to: Vineet Jain <jain.vineet@mila.quebec>.
Pseudocode	Yes	Algorithm 1 Merlin algorithm. Red and blue statements apply only for Merlin-P and Merlin-NP, respectively. Purple statements apply to both. Algorithm 2 Detailed Merlin algorithm. Red and blue statements apply only for Merlin-P and Merlin-NP, respectively. Purple statements apply to both.
Open Source Code	Yes	Code for Merlin is available at https://github.com/ vineetjain96/merlin.
Open Datasets	Yes	We evaluate Merlin on several goal-conditioned control tasks using the benchmark introduced in Yang et al. (2021). The benchmark consists of two settings expert and random . The expert dataset consists of trajectories collected by a policy trained using online DDPG+HER with added Gaussian noise (σ = 0.2) to increase diversity, while the random dataset consists of trajectories collected by sampling random actions.
Dataset Splits	No	The paper mentions using an 'offline benchmark' and discusses 'ablations' and 'tuned values' but does not specify the explicit train/validation/test splits, percentages, or methodology for creating these splits within the datasets.
Hardware Specification	No	The paper states: 'Mila and the Digital Research Alliance of Canada provided computational resources.' This is a general statement and does not specify particular GPU models, CPU models, or detailed cloud instance specifications.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and 'GRU' but does not provide specific version numbers for any software, libraries, or programming languages.
Experiment Setup	Yes	The policy is parameterized as a diagonal Gaussian distribution using an MLP with three hidden layers of 256 units each with the Re LU activation function, except for the final layer. The input to the policy comprises the state, the desired goal, and the time horizon. The time horizon is encoded using sinusoidal positional embeddings of 32 dimensions with the maximum period set to T = 50 since that is the maximum length of the trajectory for all our tasks. The policy was trained for 500k mini-batch updates using Adam optimizer with a learning rate of 5 10 4 and a batch size of 512. The optimal values for the hindsight ratio and the time horizon are provided in Appendix D.2.