reproducibilityindex.ai

Replay across Experiments: A Natural Extension of Off-Policy RL

Authors: Dhruva Tirumala, Thomas Lampe, Jose Enrique Chen, Tuomas Haarnoja, Sandy Huang, Guy Lever, Ben Moran, Tim Hertweck, Leonard Hasenclever, Martin Riedmiller, Nicolas Heess, Markus Wulfmeier

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices.
Researcher Affiliation	Collaboration	1Google Deep Mind 2University College London (UCL)
Pseudocode	No	The paper describes the algorithms (DMPO, SAC-Q, CRR, AWAC) in detail within the text and appendices but does not include structured pseudocode blocks or clearly labeled algorithm figures.
Open Source Code	No	The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	RL Unplugged Finally, we evaluate with the offline RL benchmark and dataset RL Unplugged (Gulcehre et al., 2020), which includes offline data on various simulated control domains.
Dataset Splits	No	The paper mentions collecting 'training data of 4e5 and 2e5 episodes' for some domains and discussing data mixing ratios, but it does not provide explicit train/validation/test dataset splits (e.g., percentages or counts) for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details (such as CPU/GPU models, memory, or accelerator types) used for running the experiments.
Software Dependencies	No	The paper mentions using specific algorithms and tools like 'Mu Jo Co' but does not provide specific version numbers for software dependencies or libraries used in their implementation.
Experiment Setup	Yes	We use a batch size of 128 with 5 seeds for the main results in Section 3 and a batch size of 256 with 2 seeds for the experiments in Section 3.4. At the beginning of each training run, policy and value-function are re-initialized in line with stand-alone experiments. The only required algorithmic change is the availability of a second replay mechanism that allows replaying prior and online data with a particular fixed ratio throughout the course of training (we use a naive 50/50 mix of offline and online data for our main results, without optimizing this ratio).