On Many-Actions Policy Gradient

Authors: Michal Nauman, Marek Cygan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach and show empirically that using MBMA alongside PPO (Schulman et al., 2017) yields better sample efficiency and higher reward sums on a variety of continuous action environments as compared to many-actions, model-based and model-free PPO baselines. We compare the performance of agents on 14 DMC tasks (Tassa et al., 2018) of varying difficulty for 1M environments steps and 15 seeds. During this training, we measure agent performance, as well as bias and variance of policy gradients.
Researcher Affiliation Collaboration 1Informatics Institute, University of Warsaw 2Ideas National Centre for Research and Development 3Nomagic. Correspondence to: Michal Nauman <nauman.mic@gmail.com>.
Pseudocode Yes Algorithm 1 MBPO / MBMA with PPO policy
Open Source Code Yes We release the code used for experiments under the following address https://github.com/naumix/On-Many-Actions-Policy Gradient.
Open Datasets Yes We compare the performance of agents on 14 DMC tasks (Tassa et al., 2018) of varying difficulty for 1M environments steps and 15 seeds. During this training, we measure agent performance, as well as bias and variance of policy gradients.
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits in terms of percentages or counts. It mentions training on environments and evaluation, but no distinct validation split.
Hardware Specification No The experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622, and ERC Starting Grant TOTAL. This mentions vendors but not specific hardware models (e.g., GPU/CPU models, memory details).
Software Dependencies No The paper mentions basing implementations on the PPO codebase from Clean RL (Huang et al., 2022b) and using ADAM optimizers. However, it does not provide specific version numbers for these software components or any other libraries.
Experiment Setup Yes Below, we provide a detailed list of hyperparameter settings used to generate results presented in Table 3. (Appendix B.2 Hyperparameters table lists: ACTION REPEAT, ACTOR OPTIMIZER, CRITIC OPTIMIZER, DYNAMICS OPTIMIZER, Q-NET OPTIMIZER, LEARNING RATEs, EPSILONs, HIDDEN LAYER SIZEs, λ, DISCOUNT RATE, BATCH SIZE, MINIBATCH SIZE, PPO EPOCHS, DYNAMICS BUFFER SIZE, DYNAMICS BATCH SIZE, NUMBER OF SIMULATED ACTIONS PER STATE, NUMBER OF SIMULATED STATES PER STATE, SIMULATION HORIZON, CLIP COEFFICIENT, MAXIMUM GRADIENT NORM, VALUE COEFFICIENT).