Risk-Averse Bayes-Adaptive Reinforcement Learning

Authors: Marc Rigter, Bruno Lacerda, Nick Hawes

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem.Our empirical results show that our algorithm significantly outperforms baseline methods on two domains.
Researcher Affiliation Academia Marc Rigter Oxford Robotics Institute University of Oxford mrigter@robots.ox.ac.uk Bruno Lacerda Oxford Robotics Institute University of Oxford bruno@robots.ox.ac.uk Nick Hawes Oxford Robotics Institute University of Oxford nickh@robots.ox.ac.uk
Pseudocode Yes Algorithm 1: Risk-Averse Bayes-Adaptive Monte Carlo Planning
Open Source Code Yes Code for the experiments is included in the supplementary material.
Open Datasets Yes Betting Game Domain We adapt this domain from the literature on conditional value at risk in Markov decision processes [3] to the Bayes-adaptive setting. ... [3] Nicole Bäuerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361 379, 2011.
Dataset Splits No The paper describes experiments in reinforcement learning environments (Betting Game and Autonomous Car Navigation) where agents interact with a simulated environment. While it mentions 'training' for policy gradient methods, it does not specify traditional 'training/test/validation dataset splits' as would be found with static datasets. The environments are dynamic and generate data through interaction.
Hardware Specification Yes Computation times are reported for a 3.2 GHz Intel i7 processor with 64 GB of RAM.
Software Dependencies No All algorithms are implemented in C++ and Gurobi is used to solve linear programs for the value iteration (VI) approaches. The paper mentions software tools (C++, Gurobi) but does not provide specific version numbers for them.
Experiment Setup Yes For RA-BAMCP and BAMCP the MCTS exploration parameter was set to cmcts = 2 based on empirical performance. For these methods, we performed 100,000 simulations for the initial step and 25,000 simulations per step thereafter. ... For RA-BAMCP, the rollout policy for the adversary was random, the progressive widening parameter was set to τ = 0.2, and the exploration parameter for the GP Bayesian optimisation was also set to cbo = 2.