reproducibilityindex.ai

On Many-Actions Policy Gradient

Authors: Michal Nauman, Marek Cygan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach and show empirically that using MBMA alongside PPO (Schulman et al., 2017) yields better sample efficiency and higher reward sums on a variety of continuous action environments as compared to many-actions, model-based and model-free PPO baselines. We compare the performance of agents on 14 DMC tasks (Tassa et al., 2018) of varying difficulty for 1M environments steps and 15 seeds. During this training, we measure agent performance, as well as bias and variance of policy gradients.
Researcher Affiliation	Collaboration	1Informatics Institute, University of Warsaw 2Ideas National Centre for Research and Development 3Nomagic. Correspondence to: Michal Nauman <nauman.mic@gmail.com>.
Pseudocode	Yes	Algorithm 1 MBPO / MBMA with PPO policy
Open Source Code	Yes	We release the code used for experiments under the following address https://github.com/naumix/On-Many-Actions-Policy Gradient.
Open Datasets	Yes	We compare the performance of agents on 14 DMC tasks (Tassa et al., 2018) of varying difficulty for 1M environments steps and 15 seeds. During this training, we measure agent performance, as well as bias and variance of policy gradients.
Dataset Splits	No	The paper does not explicitly provide details about training/validation/test dataset splits in terms of percentages or counts. It mentions training on environments and evaluation, but no distinct validation split.
Hardware Specification	No	The experiments were performed using the Entropy cluster funded by NVIDIA, Intel, the Polish National Science Center grant UMO-2017/26/E/ST6/00622, and ERC Starting Grant TOTAL. This mentions vendors but not specific hardware models (e.g., GPU/CPU models, memory details).
Software Dependencies	No	The paper mentions basing implementations on the PPO codebase from Clean RL (Huang et al., 2022b) and using ADAM optimizers. However, it does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	Below, we provide a detailed list of hyperparameter settings used to generate results presented in Table 3. (Appendix B.2 Hyperparameters table lists: ACTION REPEAT, ACTOR OPTIMIZER, CRITIC OPTIMIZER, DYNAMICS OPTIMIZER, Q-NET OPTIMIZER, LEARNING RATEs, EPSILONs, HIDDEN LAYER SIZEs, λ, DISCOUNT RATE, BATCH SIZE, MINIBATCH SIZE, PPO EPOCHS, DYNAMICS BUFFER SIZE, DYNAMICS BATCH SIZE, NUMBER OF SIMULATED ACTIONS PER STATE, NUMBER OF SIMULATED STATES PER STATE, SIMULATION HORIZON, CLIP COEFFICIENT, MAXIMUM GRADIENT NORM, VALUE COEFFICIENT).