reproducibilityindex.ai

Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning

Authors: Chenyang Wu, Tianci Li, Zongzhang Zhang, Yang Yu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also develop techniques for effective optimization and show through some simulation experiments that BOO is competitive with the existing algorithms.
Researcher Affiliation	Collaboration	Chenyang Wu1, Tianci Li1, Zongzhang Zhang1 , Yang Yu1, 2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Pazhou Lab, Guangzhou, China {wucy, litc}@lamda.nju.edu.cn, {zzzhang, yuy}@nju.edu.cn
Pseudocode	Yes	Algorithm 1 OFU RL 1: for episode k = 1, 2, . . . do 2: Construct a confidence set Mk with Hk 3: Compute πk arg maxπ max Mk V π,Mk 1 (s1) s.t. Mk Mk 4: Execute πk for an episode
Open Source Code	Yes	3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets	Yes	The River Swim is an MDP where states are organized in chains, and the agent can move left or right, as shown in Figure 1(a).", "The chain MDP is a variant of the River Swim, which has Gaussian rewards and relatively deterministic transitions, as shown in Figure 1(b).", "Random MDPs are tabular MDP models randomly generated from a prior distribution and used to test the general performance of the algorithm.
Dataset Splits	No	The paper conducts experiments in RL environments over episodes and timesteps, using multiple random seeds and trials, but does not specify explicit training/validation/test dataset splits in the traditional sense, nor does it mention a dedicated 'validation' set.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory) to run its experiments in the main text.
Software Dependencies	Yes	[53] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors, Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods, vol. 17, pp. 261 272, 2020.
Experiment Setup	Yes	We propose some techniques to improve the optimization efficiency of BOO, and conduct ablation experiments to verify the effectiveness of our proposed methods in Section G. Two of the most effective techniques are described below... Mean Reward Bonus... Entropy Regularization... starts with a high initial entropy regularization and gradually annealing. The hyperparameter ζ controls the amount of entropy.