Understanding when Dynamics-Invariant Data Augmentations Benefit Model-free Reinforcement Learning Updates
Authors: Nicholas Corrado, Josiah P. Hanna
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we isolate three relevant aspects of DA: state-action coverage, reward density, and the number of augmented transitions generated per update (the augmented replay ratio). From our experiments, we draw two conclusions: (1) increasing state-action coverage often has a much greater impact on data efficiency than increasing reward density, and (2) decreasing the augmented replay ratio substantially improves data efficiency. |
| Researcher Affiliation | Academia | Nicholas E. Corrado Department of Computer Sciences University of Wisconsin Madison ncorrado@wisc.edu Josiah P. Hanna Department of Computer Sciences University of Wisconsin Madison jphanna@cs.wisc.edu |
| Pseudocode | Yes | Algorithm 1 Off-Policy RL with Data Augmentation |
| Open Source Code | Yes | 1Code available at https://github.com/Badger-RL/Understanding Data Augmentation For RL |
| Open Datasets | Yes | We focus our experiments on four sparse-reward, continuous action panda-gym tasks (Gallouédec et al., 2021): Panda Push-v3, Panda Slide-v3, Panda Pick And Place-v3, and Panda Flip-v3 (Fig. 2) |
| Dataset Splits | No | The paper mentions evaluating over a certain number of seeds and discussing how batch sizes are scaled, but it does not provide specific percentages or counts for training, validation, and test dataset splits, nor does it describe cross-validation setups. |
| Hardware Specification | Yes | We ran all experiments on a compute cluster using a mix of CPU-only and GPU jobs. This cluster contains a mix of Tesla P100-PCIE, Ge Force RTX 2080 Ti, and A100-SXM4 GPUs. |
| Software Dependencies | No | The paper mentions using 'Stable Baselines3', 'DDPG', 'TD3', and 'Adam' with citations, but it does not specify version numbers for any of these software libraries or frameworks, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Table 2: Default hyperparameters used in all Panda tasks. Episode length at most 50 timesteps Evaluation frequency 10,000 timesteps Number of evaluation episodes 80 Number of environment interactions 600K (Push), 1M (Slide, Flip), 1.5M (Pick And Place) Random action probability 0.3 Gaussian action noise scale 0.2 # of random actions before learning 1000 Observed replay buffer size (default) 1 106 Augmented replay buffer size (default) 1 106 Batch size (default) 256 (Push, Slide), 512 (Flip, Pick And Place) Update frequency Every 2 timesteps (observed replay ratio of 0.5) Network Multi-layer perceptron with hidden layers (256, 256, 256) Optimizer Adam (Kingma and Ba, 2014) Learning rate 0.001 Polyak averaging coefficient (τ) 0.95 |