reproducibilityindex.ai

Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates

Authors: Nicholas Corrado, Josiah P. Hanna

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency.
Researcher Affiliation	Academia	Nicholas E. Corrado Department of Computer Sciences University of Wisconsin Madison ncorrado@wisc.edu Josiah P. Hanna Department of Computer Sciences University of Wisconsin Madison jphanna@cs.wisc.edu
Pseudocode	Yes	Algorithm 1 Off-Policy RL with Data Augmentation
Open Source Code	Yes	1Code available at https://github.com/Badger-RL/Understanding Data Augmentation For RL
Open Datasets	Yes	We focus our experiments on four sparse-reward, continuous action panda-gym tasks (Gallouédec et al., 2021): Panda Push-v3, Panda Slide-v3, Panda Pick And Place-v3, and Panda Flip-v3 (Fig. 2)
Dataset Splits	No	The paper mentions evaluating over a certain number of seeds and discussing how batch sizes are scaled, but it does not provide specific percentages or counts for training, validation, and test dataset splits, nor does it describe cross-validation setups.
Hardware Specification	Yes	We ran all experiments on a compute cluster using a mix of CPU-only and GPU jobs. This cluster contains a mix of Tesla P100-PCIE, Ge Force RTX 2080 Ti, and A100-SXM4 GPUs.
Software Dependencies	No	The paper mentions using 'Stable Baselines3', 'DDPG', 'TD3', and 'Adam' with citations, but it does not specify version numbers for any of these software libraries or frameworks, which are necessary for full reproducibility.
Experiment Setup	Yes	Table 2: Default hyperparameters used in all Panda tasks. Episode length at most 50 timesteps Evaluation frequency 10,000 timesteps Number of evaluation episodes 80 Number of environment interactions 600K (Push), 1M (Slide, Flip), 1.5M (Pick And Place) Random action probability 0.3 Gaussian action noise scale 0.2 # of random actions before learning 1000 Observed replay buffer size (default) 1 106 Augmented replay buffer size (default) 1 106 Batch size (default) 256 (Push, Slide), 512 (Flip, Pick And Place) Update frequency Every 2 timesteps (observed replay ratio of 0.5) Network Multi-layer perceptron with hidden layers (256, 256, 256) Optimizer Adam (Kingma and Ba, 2014) Learning rate 0.001 Polyak averaging coefficient (τ) 0.95