Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Authors: Qi Zhou, HouQiang Li, Jie Wang6941-6948

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show POMBU can outperform existing state-of-the-art policy optimization algorithms in terms of sample efficiency and asymptotic performance. Moreover, the experiments demonstrate the excellent robustness of POMBU compared to previous model-based approaches. In this section, we fist evaluate our uncertainty estimation method. Second, we compare POMBU to state-of-the-arts. Then, we show how does the estimated uncertainty work by ablation study. Finally, we analyze the robustness of our method empirically.
Researcher Affiliation Academia Qi Zhou, Hou Qiang Li, Jie Wang University of Science and Technology of China zhouqida@mail.ustc.edu.cn {lihq, jiewangx}@ustc.edu.cn
Pseudocode Yes Algorithm 1: Uncertainty Estimation for Q-values; Algorithm 2: POMBU
Open Source Code Yes The source code and appendix of this work is available at https://github.com/MIRALab-USTC/RL-POMBU.
Open Datasets No The paper uses continuous control tasks in Mujoco environments (Swimmer, Half Cheetah, Ant, Walker2d), from which data is sampled during training. While Mujoco environments are publicly available, the paper does not provide a specific link or citation to a pre-collected, static public dataset used for training, which is typically what 'open dataset' refers to.
Dataset Splits No The paper mentions 'Different models are trained with different train-validation split' in Section 5 (Model Ensemble), but does not provide specific details about these splits (e.g., percentages, sample counts, or a reference to a standard split) for reproducibility. The data itself is generated on-the-fly from the environment, not from a fixed dataset with pre-defined splits.
Hardware Specification Yes We conduct all experiments with one GPU Nvidia GTX 2080Ti.
Software Dependencies No The paper mentions 'Adam' as an optimizer ('optimize the parameter using Adam (Kingma and Ba 2014)') but does not provide specific version numbers for Adam or any other software components, libraries, or frameworks (e.g., TensorFlow, PyTorch, Python).
Experiment Setup Yes We evaluate POMBU with α = 0.5 and β = 10 for all tasks. Here, we define ˆrθ(sh, ah) as max(1 ϵ, rθ(sh, ah)), if Ah old(sh, ah) < 0, min(1 + ϵ, rθ(sh, ah)), if Ah old(sh, ah) > 0, in which ϵ > 0 is a hyperparameter. We use a Gaussian policy whose mean is computed by a forward neural network and standard deviation is represented by a vector of parameters. We optimizing all parameters by maximizing Lπ(θ) via Adam.