Replay across Experiments: A Natural Extension of Off-Policy RL

Authors: Dhruva Tirumala, Thomas Lampe, Jose Enrique Chen, Tuomas Haarnoja, Sandy Huang, Guy Lever, Ben Moran, Tim Hertweck, Leonard Hasenclever, Martin Riedmiller, Nicolas Heess, Markus Wulfmeier

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices.
Researcher Affiliation Collaboration 1Google Deep Mind 2University College London (UCL)
Pseudocode No The paper describes the algorithms (DMPO, SAC-Q, CRR, AWAC) in detail within the text and appendices but does not include structured pseudocode blocks or clearly labeled algorithm figures.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes RL Unplugged Finally, we evaluate with the offline RL benchmark and dataset RL Unplugged (Gulcehre et al., 2020), which includes offline data on various simulated control domains.
Dataset Splits No The paper mentions collecting 'training data of 4e5 and 2e5 episodes' for some domains and discussing data mixing ratios, but it does not provide explicit train/validation/test dataset splits (e.g., percentages or counts) for reproduction.
Hardware Specification No The paper does not provide specific hardware details (such as CPU/GPU models, memory, or accelerator types) used for running the experiments.
Software Dependencies No The paper mentions using specific algorithms and tools like 'Mu Jo Co' but does not provide specific version numbers for software dependencies or libraries used in their implementation.
Experiment Setup Yes We use a batch size of 128 with 5 seeds for the main results in Section 3 and a batch size of 256 with 2 seeds for the experiments in Section 3.4. At the beginning of each training run, policy and value-function are re-initialized in line with stand-alone experiments. The only required algorithmic change is the availability of a second replay mechanism that allows replaying prior and online data with a particular fixed ratio throughout the course of training (we use a naive 50/50 mix of offline and online data for our main results, without optimizing this ratio).