Robust Reinforcement Learning for Continuous Control with Model Misspecification

Authors: Daniel J. Mankowitz, Nir Levine, Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Yuanyuan Shi, Jackie Kay, Todd Hester, Timothy Mann, Martin Riedmiller

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We want to emphasize that, while the theoretical contributions are novel, our most significant contribution is that of the extensive experimental analysis we have performed to analyze the robustness performance of our agent. Specifically: (3) We present experimental results in nine Mujoco domains showing that RE-MPO, SRE-MPO and R-MPO, SR-MPO outperform both E-MPO and MPO respectively.
Researcher Affiliation Industry Daniel J. Mankowitz , Nir Levine , Rae Jeong, Abbas Abdolmaleki, Jost Tobias Springenberg, Yuanyuan Shi , Jackie Kay, Timothy Mann, Todd Hester, Martin Riedmiller Deep Mind {dmankowitz, nirlevine, raejeong, aabdolmaleki, springenberg yyshi, kayj, timothymann, toddhester, riedmiller}@google.com
Pseudocode Yes The pseudo code for the R-MPO, RE-MPO and Soft-Robust Entropy-regularized MPO (SRE-MPO) algorithms can be found in Appendix I (Algorithms 1, 2 and 3 respectively).
Open Source Code No The paper does not provide a direct link or explicit statement about releasing the source code for the described methodology. The provided link is for performance videos.
Open Datasets Yes We now present experiments on nine different continuous control domains (...) from the Deep Mind control suite (Tassa et al., 2018).
Dataset Splits Yes Both the robust and non-robust agents are evaluated on a test set of three unseen task perturbations. (...) The chosen values of the uncertainty set and evaluation set for each domain can be found in Appendix H.3. Note that it is common practice to manually select the pre-defined uncertainty set and the unseen test environments.
Hardware Specification No The paper discusses the simulated environments (e.g., Mujoco domains, Shadow hand) and general settings like 'training run consists of 30k episodes', but does not specify any actual hardware (GPU/CPU models, memory) used to run these simulations or training.
Software Dependencies No The paper provides hyperparameter tables (Table 1 and Table 2) that list various settings for the models and training, but it does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch versions, specific library versions).
Experiment Setup Yes Each training run consists of 30k episodes and the experiments are repeated 5 times. (...) Tables 2 and 1 show the hyperparameters used for the MPO and SVG algorithms. All experiments use a feed-forward two layer neural network with 50 neurons to map the current state of the network to the mean and diagonal covariance of the Gaussian policy.