Scalable Bayesian Inverse Reinforcement Learning
Authors: Alex James Chan, Mihaela van der Schaar
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods, as well as task performance competitive with focused offline imitation learning algorithms. |
| Researcher Affiliation | Academia | Alex J. Chan University of Cambridge, Cambridge, UK alexjchan@maths.cam.ac.uk Mihaela van der Schaar University of Cambridge, Cambridge, UK University of California, Los Angeles, USA Cambridge Centre for AI in Medicine, UK The Alan Turing Institute, London, UK mv472@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1: Approximate Variational Reward Imitation Learning (AVRIL) |
| Open Source Code | Yes | Code for AVRIL and our experiments is made available at https://github.com/XanderJC/scalable-birl and https://github.com/vanderschaarlab/mlforhealthlabpub. |
| Open Datasets | Yes | We evaluate the ability of the methods to learn a medical policy in both the two and four action setting specifically whether the patient should be placed on a ventilator, and the decision for ventilation in combination with antibiotic treatment. These represent the two most common, and important, clinical interventions recorded in the data. Without a recorded notion of reward, performance is measured with respect to action matching against a held out test set of demonstrations with cross-validation. [...] Demonstrations taken from the Medical Information Mart for Intensive Care (MIMIC-III) dataset (Johnson et al., 2016). [...] The standard control problems of: Cart Pole, Acrobot, and Lunar Lander. In these settings given sufficient demonstration data all benchmarks are very much capable of reaching demonstrator level performance, so we test the algorithms on their ability to handle sample complexity in the low data regime by testing their performance when given access to a select number of trajectories which we adjust, replicating the setup in Jarrett et al. (2020). With access to a simulation through the Open AI gym (Brockman et al., 2016), we measure performance by deploying the learnt policies live and calculating their average return over 300 episodes. |
| Dataset Splits | Yes | Without a recorded notion of reward, performance is measured with respect to action matching against a held out test set of demonstrations with cross-validation. |
| Hardware Specification | No | The paper mentions that 'DQN training for example easily stretches into hours' but does not provide any specific hardware details such as GPU models, CPU types, or cloud instance specifications used for their experiments. |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2014)', 'Open AI Stable Baselines (Hill et al., 2018)', and 'Open AI gym (Brockman et al., 2016)', and references code for benchmarks like VDICE (Kostrikov et al., 2019), DSFN (Lee et al., 2019), and EDM (Jarrett et al., 2020). However, it does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | For aid in comparison all methods share the same network architecture of two hidden layers of 64 units with ELU activation functions and are trained using Adam (Kingma & Ba, 2014) with learning rates individually tuned. Further details on experimental setup and the implementation of benchmarks can be found in the appendix. [...] All methods are neural network based and so in experiments they share the same architecture of 2 hidden layers of 64 units each connected by exponential linear unit (ELU) activation functions. |