reproducibilityindex.ai

How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

Authors: Pierluca D'Oro, Wojciech Jaśkowski

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a set of Mu Jo Co continuous-control tasks, we demonstrate the efﬁciency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
Researcher Affiliation	Collaboration	Pierluca D Oro MILA, Université de Montréal pierluca.doro@mila.quebec Wojciech Ja skowski NNAISENSE wojciech@nnaisense.com
Pseudocode	Yes	Algorithm 1 Model-based Action-Gradient-Estimator Policy Optimization (MAGE)
Open Source Code	Yes	The Py Torch [34] implementation, based on [46], is available at https://github.com/nnaisense/MAGE.
Open Datasets	Yes	We employ environments from Open AI Gym [6] and the Mu Jo Co physics simulator [55] as continuous control benchmarks
Dataset Splits	No	The paper utilizes continuous control benchmarks like OpenAI Gym and MuJoCo where data is generated through interaction and stored in a replay buffer. It does not provide explicit train/validation/test dataset splits with percentages or sample counts for reproduction, as is common for static datasets.
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies	No	The paper mentions using 'Py Torch [34]' for implementation but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	We employ a single value of λ = 0.2 for all the environments, since we found MAGE to be reasonably robust to the choice of this hyperparameter (see Appendix B). In order to reduce the impact of model bias, MAGE leverages an ensemble of 8 probabilistic Gaussian-output models, trained by maximum likelihood estimation. After each step of environment interaction, we add the collected transition in the replay buffer B, train the approximate model pω, and update critic and actor 10 times.