Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning
Authors: Chenyang Wu, Tianci Li, Zongzhang Zhang, Yang Yu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also develop techniques for effective optimization and show through some simulation experiments that BOO is competitive with the existing algorithms. |
| Researcher Affiliation | Collaboration | Chenyang Wu1, Tianci Li1, Zongzhang Zhang1 , Yang Yu1, 2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Pazhou Lab, Guangzhou, China {wucy, litc}@lamda.nju.edu.cn, {zzzhang, yuy}@nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 OFU RL 1: for episode k = 1, 2, . . . do 2: Construct a confidence set Mk with Hk 3: Compute πk arg maxπ max Mk V π,Mk 1 (s1) s.t. Mk Mk 4: Execute πk for an episode |
| Open Source Code | Yes | 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | The River Swim is an MDP where states are organized in chains, and the agent can move left or right, as shown in Figure 1(a).", "The chain MDP is a variant of the River Swim, which has Gaussian rewards and relatively deterministic transitions, as shown in Figure 1(b).", "Random MDPs are tabular MDP models randomly generated from a prior distribution and used to test the general performance of the algorithm. |
| Dataset Splits | No | The paper conducts experiments in RL environments over episodes and timesteps, using multiple random seeds and trials, but does not specify explicit training/validation/test dataset splits in the traditional sense, nor does it mention a dedicated 'validation' set. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used (e.g., GPU/CPU models, memory) to run its experiments in the main text. |
| Software Dependencies | Yes | [53] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng, E. W. Moore, J. Vander Plas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and Sci Py 1.0 Contributors, Sci Py 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods, vol. 17, pp. 261 272, 2020. |
| Experiment Setup | Yes | We propose some techniques to improve the optimization efficiency of BOO, and conduct ablation experiments to verify the effectiveness of our proposed methods in Section G. Two of the most effective techniques are described below... Mean Reward Bonus... Entropy Regularization... starts with a high initial entropy regularization and gradually annealing. The hyperparameter ζ controls the amount of entropy. |