reproducibilityindex.ai

Deep Model-Based Reinforcement Learning via Estimated Uncertainty and Conservative Policy Optimization

Authors: Qi Zhou, HouQiang Li, Jie Wang6941-6948

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show POMBU can outperform existing state-of-the-art policy optimization algorithms in terms of sample efﬁciency and asymptotic performance. Moreover, the experiments demonstrate the excellent robustness of POMBU compared to previous model-based approaches. In this section, we ﬁst evaluate our uncertainty estimation method. Second, we compare POMBU to state-of-the-arts. Then, we show how does the estimated uncertainty work by ablation study. Finally, we analyze the robustness of our method empirically.
Researcher Affiliation	Academia	Qi Zhou, Hou Qiang Li, Jie Wang University of Science and Technology of China zhouqida@mail.ustc.edu.cn {lihq, jiewangx}@ustc.edu.cn
Pseudocode	Yes	Algorithm 1: Uncertainty Estimation for Q-values; Algorithm 2: POMBU
Open Source Code	Yes	The source code and appendix of this work is available at https://github.com/MIRALab-USTC/RL-POMBU.
Open Datasets	No	The paper uses continuous control tasks in Mujoco environments (Swimmer, Half Cheetah, Ant, Walker2d), from which data is sampled during training. While Mujoco environments are publicly available, the paper does not provide a specific link or citation to a pre-collected, static public dataset used for training, which is typically what 'open dataset' refers to.
Dataset Splits	No	The paper mentions 'Different models are trained with different train-validation split' in Section 5 (Model Ensemble), but does not provide specific details about these splits (e.g., percentages, sample counts, or a reference to a standard split) for reproducibility. The data itself is generated on-the-fly from the environment, not from a fixed dataset with pre-defined splits.
Hardware Specification	Yes	We conduct all experiments with one GPU Nvidia GTX 2080Ti.
Software Dependencies	No	The paper mentions 'Adam' as an optimizer ('optimize the parameter using Adam (Kingma and Ba 2014)') but does not provide specific version numbers for Adam or any other software components, libraries, or frameworks (e.g., TensorFlow, PyTorch, Python).
Experiment Setup	Yes	We evaluate POMBU with α = 0.5 and β = 10 for all tasks. Here, we deﬁne ˆrθ(sh, ah) as max(1 ϵ, rθ(sh, ah)), if Ah old(sh, ah) < 0, min(1 + ϵ, rθ(sh, ah)), if Ah old(sh, ah) > 0, in which ϵ > 0 is a hyperparameter. We use a Gaussian policy whose mean is computed by a forward neural network and standard deviation is represented by a vector of parameters. We optimizing all parameters by maximizing Lπ(θ) via Adam.