Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization
Authors: Qi Zhou, HouQiang Li, Jie Wang6941-6948
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show POMBU can outperform existing state-of-the-art policy optimization algorithms in terms of sample efficiency and asymptotic performance. Moreover, the experiments demonstrate the excellent robustness of POMBU compared to previous model-based approaches. In this section, we fist evaluate our uncertainty estimation method. Second, we compare POMBU to state-of-the-arts. Then, we show how does the estimated uncertainty work by ablation study. Finally, we analyze the robustness of our method empirically. |
| Researcher Affiliation | Academia | Qi Zhou, Hou Qiang Li, Jie Wang University of Science and Technology of China zhouqida@mail.ustc.edu.cn {lihq, jiewangx}@ustc.edu.cn |
| Pseudocode | Yes | Algorithm 1: Uncertainty Estimation for Q-values; Algorithm 2: POMBU |
| Open Source Code | Yes | The source code and appendix of this work is available at https://github.com/MIRALab-USTC/RL-POMBU. |
| Open Datasets | No | The paper uses continuous control tasks in Mujoco environments (Swimmer, Half Cheetah, Ant, Walker2d), from which data is sampled during training. While Mujoco environments are publicly available, the paper does not provide a specific link or citation to a pre-collected, static public dataset used for training, which is typically what 'open dataset' refers to. |
| Dataset Splits | No | The paper mentions 'Different models are trained with different train-validation split' in Section 5 (Model Ensemble), but does not provide specific details about these splits (e.g., percentages, sample counts, or a reference to a standard split) for reproducibility. The data itself is generated on-the-fly from the environment, not from a fixed dataset with pre-defined splits. |
| Hardware Specification | Yes | We conduct all experiments with one GPU Nvidia GTX 2080Ti. |
| Software Dependencies | No | The paper mentions 'Adam' as an optimizer ('optimize the parameter using Adam (Kingma and Ba 2014)') but does not provide specific version numbers for Adam or any other software components, libraries, or frameworks (e.g., TensorFlow, PyTorch, Python). |
| Experiment Setup | Yes | We evaluate POMBU with α = 0.5 and β = 10 for all tasks. Here, we define ˆrθ(sh, ah) as max(1 ϵ, rθ(sh, ah)), if Ah old(sh, ah) < 0, min(1 + ϵ, rθ(sh, ah)), if Ah old(sh, ah) > 0, in which ϵ > 0 is a hyperparameter. We use a Gaussian policy whose mean is computed by a forward neural network and standard deviation is represented by a vector of parameters. We optimizing all parameters by maximizing Lπ(θ) via Adam. |