Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning

Authors: Chenyang Wu, Tianci Li, Zongzhang Zhang, Yang Yu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also develop techniques for effective optimization and show through some simulation experiments that BOO is competitive with the existing algorithms.
Researcher Affiliation Collaboration Chenyang Wu1, Tianci Li1, Zongzhang Zhang1 , Yang Yu1, 2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Pazhou Lab, Guangzhou, China {wucy, litc}@lamda.nju.edu.cn, {zzzhang, yuy}@nju.edu.cn
Pseudocode Yes Algorithm 1 OFU RL 1: for episode k = 1, 2, . . . do 2: Construct a confidence set Mk with Hk 3: Compute πk arg maxπ max Mk V π,Mk 1 (s1) s.t. Mk Mk 4: Execute πk for an episode
Open Source Code Yes 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Open Datasets Yes The River Swim is an MDP where states are organized in chains, and the agent can move left or right, as shown in Figure 1(a).", "The chain MDP is a variant of the River Swim, which has Gaussian rewards and relatively deterministic transitions, as shown in Figure 1(b).", "Random MDPs are tabular MDP models randomly generated from a prior distribution and used to test the general performance of the algorithm.
Dataset Splits No The paper conducts experiments in RL environments over episodes and timesteps, using multiple random seeds and trials, but does not specify explicit training/validation/test dataset splits in the traditional sense, nor does it mention a dedicated 'validation' set.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory) to run its experiments in the main text.
Software Dependencies Yes [53] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors, Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods, vol. 17, pp. 261 272, 2020.
Experiment Setup Yes We propose some techniques to improve the optimization efficiency of BOO, and conduct ablation experiments to verify the effectiveness of our proposed methods in Section G. Two of the most effective techniques are described below... Mean Reward Bonus... Entropy Regularization... starts with a high initial entropy regularization and gradually annealing. The hyperparameter ζ controls the amount of entropy.