How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization
Authors: Pierluca D'Oro, Wojciech Jaśkowski
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a set of Mu Jo Co continuous-control tasks, we demonstrate the efficiency of the algorithm in comparison to model-free and model-based state-of-the-art baselines. |
| Researcher Affiliation | Collaboration | Pierluca D Oro MILA, Université de Montréal pierluca.doro@mila.quebec Wojciech Ja skowski NNAISENSE wojciech@nnaisense.com |
| Pseudocode | Yes | Algorithm 1 Model-based Action-Gradient-Estimator Policy Optimization (MAGE) |
| Open Source Code | Yes | The Py Torch [34] implementation, based on [46], is available at https://github.com/nnaisense/MAGE. |
| Open Datasets | Yes | We employ environments from Open AI Gym [6] and the Mu Jo Co physics simulator [55] as continuous control benchmarks |
| Dataset Splits | No | The paper utilizes continuous control benchmarks like OpenAI Gym and MuJoCo where data is generated through interaction and stored in a replay buffer. It does not provide explicit train/validation/test dataset splits with percentages or sample counts for reproduction, as is common for static datasets. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Py Torch [34]' for implementation but does not specify version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We employ a single value of λ = 0.2 for all the environments, since we found MAGE to be reasonably robust to the choice of this hyperparameter (see Appendix B). In order to reduce the impact of model bias, MAGE leverages an ensemble of 8 probabilistic Gaussian-output models, trained by maximum likelihood estimation. After each step of environment interaction, we add the collected transition in the replay buffer B, train the approximate model pω, and update critic and actor 10 times. |