Bayesian Bellman Operators
Authors: Mattie Fellows, Kristian Hartikainen, Shimon Whiteson
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that algorithms derived from the BBO framework have sophisticated deep exploration properties that enable them to solve continuous control tasks at which state-of-the-art regularised actor-critic algorithms fail catastrophically. [...] 6 Experiments Figure 3: Tsitsiklis counterexample. Convergent Nonlinear Policy Evaluation To confirm our convergence and consistency results under approximation, we evaluate BBO in several nonlinear policy evaluation experiments that are constructed to present a convergence challenge for TD algorithms. We verify the convergence of nonlinear Gaussian BBO in the famous counterexample task of Tsitsiklis and Van Roy [76], in which the TD(0) algorithm is provably divergent. The results are presented in Fig. 3. As expected, TD(0) diverges, while BBO converges to the optimal solution faster than convergent frequentist nonlinear TDC and GTD2 [12]. We also consider three additional policy evaluation tasks commonly used to test convergence of nonlinear TD using neural network function approximators: 20-Link Pendulum [23], Puddle World [16], and Mountain Car [16]. Results are shown in Fig. 11 of Appendix H.3 from which we conclude that i) by ignoring the posterior s dependence on !, existing model-free Bayesian approaches are less stable and perform poorly in comparison to the gradient based MSBBE minimisation approach in Eq. (7), ii) regularisation from a prior can improve performance of policy evaluation by aiding the optimisation landscape [26], and iii) better solutions in terms of mean squared error can be found using BBO instead of the local linearisation approach of nonlinear TDC/GTD2[12]. |
| Researcher Affiliation | Academia | Matthew Fellows Kristian Hartikainen Shimon Whiteson Department of Computer Science University of Oxford |
| Pseudocode | Yes | Algorithm 1 RP-BBAC Initialise L, L, L, EL, and D ? Sample initial state s P0 while not converged do Sample policy l Unif( L) for n 2 {1, ...Nenv} do Sample action a l( |s) Observe next state s0 P( |s, a) Observe reward r = r(s0, a, s) D D [ {s, a, r, s0} end for L, L, L UPDATEPOSTERIOR UPDATEBEHAVIOURALPOLICY end while |
| Open Source Code | Yes | 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Provided in the supplemental material. |
| Open Datasets | Yes | We consider a set of continuous control tasks with sparse rewards as continuous analogues of the discrete experiments used to test Boot DQN+Prior [57]: Mountain Car-Continuous-v0 from Gym benchmark suite and a slightly modified version of the cartpole-swingup_sparse from Deep Mind Control Suite [73]. |
| Dataset Splits | No | The paper uses well-known benchmark environments but does not specify explicit training, validation, and test splits within these environments or for any datasets used. It focuses on the experimental outcome in the environments directly. |
| Hardware Specification | No | The ethics review states: "Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]". The only mention of hardware is "The experiments were made possible by a generous equipment grant from NVIDIA." which is not specific enough. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. It only states that details are in Appendix G for the algorithm, but this does not specify software versions. |
| Experiment Setup | Yes | Additional details are in Appendix G. [...] We discuss RP-BBAC s sensitivity to randomized prior hyperparameters in Appendix I.2. |