On Many-Actions Policy Gradient
Authors: Michal Nauman, Marek Cygan
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach and show empirically that using MBMA alongside PPO (Schulman et al., 2017) yields better sample efficiency and higher reward sums on a variety of continuous action environments as compared to many-actions, model-based and model-free PPO baselines. We compare the performance of agents on 14 DMC tasks (Tassa et al., 2018) of varying difficulty for 1M environments steps and 15 seeds. During this training, we measure agent performance, as well as bias and variance of policy gradients. |
| Researcher Affiliation | Collaboration | 1Informatics Institute, University of Warsaw 2Ideas National Centre for Research and Development 3Nomagic. Correspondence to: Michal Nauman <nauman.mic@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 MBPO / MBMA with PPO policy |
| Open Source Code | Yes | We release the code used for experiments under the following address https://github.com/naumix/On-Many-Actions-Policy Gradient. |
| Open Datasets | Yes | We compare the performance of agents on 14 DMC tasks (Tassa et al., 2018) of varying difficulty for 1M environments steps and 15 seeds. During this training, we measure agent performance, as well as bias and variance of policy gradients. |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits in terms of percentages or counts. It mentions training on environments and evaluation, but no distinct validation split. |
| Hardware Specification | No | The experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622, and ERC Starting Grant TOTAL. This mentions vendors but not specific hardware models (e.g., GPU/CPU models, memory details). |
| Software Dependencies | No | The paper mentions basing implementations on the PPO codebase from Clean RL (Huang et al., 2022b) and using ADAM optimizers. However, it does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | Below, we provide a detailed list of hyperparameter settings used to generate results presented in Table 3. (Appendix B.2 Hyperparameters table lists: ACTION REPEAT, ACTOR OPTIMIZER, CRITIC OPTIMIZER, DYNAMICS OPTIMIZER, Q-NET OPTIMIZER, LEARNING RATEs, EPSILONs, HIDDEN LAYER SIZEs, λ, DISCOUNT RATE, BATCH SIZE, MINIBATCH SIZE, PPO EPOCHS, DYNAMICS BUFFER SIZE, DYNAMICS BATCH SIZE, NUMBER OF SIMULATED ACTIONS PER STATE, NUMBER OF SIMULATED STATES PER STATE, SIMULATION HORIZON, CLIP COEFFICIENT, MAXIMUM GRADIENT NORM, VALUE COEFFICIENT). |