How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization

Authors: Pierluca D'Oro, Wojciech Jaśkowski

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a set of Mu Jo Co continuous-control tasks, we demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines.
Researcher Affiliation Collaboration Pierluca D Oro MILA, Université de Montréal pierluca.doro@mila.quebec Wojciech Ja skowski NNAISENSE wojciech@nnaisense.com
Pseudocode Yes Algorithm 1 Model-based Action-Gradient-Estimator Policy Optimization (MAGE)
Open Source Code Yes The Py Torch [34] implementation, based on [46], is available at https://github.com/nnaisense/MAGE.
Open Datasets Yes We employ environments from Open AI Gym [6] and the Mu Jo Co physics simulator [55] as continuous control benchmarks
Dataset Splits No The paper utilizes continuous control benchmarks like OpenAI Gym and MuJoCo where data is generated through interaction and stored in a replay buffer. It does not provide explicit train/validation/test dataset splits with percentages or sample counts for reproduction, as is common for static datasets.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions using 'Py Torch [34]' for implementation but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We employ a single value of λ = 0.2 for all the environments, since we found MAGE to be reasonably robust to the choice of this hyperparameter (see Appendix B). In order to reduce the impact of model bias, MAGE leverages an ensemble of 8 probabilistic Gaussian-output models, trained by maximum likelihood estimation. After each step of environment interaction, we add the collected transition in the replay buffer B, train the approximate model pω, and update critic and actor 10 times.