Strictly Batch Imitation Learning by Energy-based Distribution Matching

Authors: Daniel Jarrett, Ioana Bica, Mihaela van der Schaar

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments with application to control and healthcare settings, we illustrate consistent performance gains over existing algorithms for strictly batch imitation learning.
Researcher Affiliation Academia Daniel Jarrett Ioana Bica Mihaela van der Schaar University of Cambridge University of Oxford University of Cambridge daniel.jarrett@maths.cam.ac.uk The Alan Turing Institute Universityof California,Los Angeles ioana.bica@eng.ox.ac.uk The Alan Turing Institute mv472@cam.ac.uk
Pseudocode Yes Algorithm 1 Energy-based Distribution Matching . for Strictly Batch Imitation Learning
Open Source Code Yes Algorithm 1 is implemented using the source code for joint EBMs [47] publicly available at [66], which already contains an implementation of SGLD.
Open Datasets Yes For the former, we use Open AI gym environments [56] of varying complexity from standard RL literature: Cart Pole, which balances a pendulum on a frictionless track [57], Acrobot, which swings a system of joints up to a given height [58], Beam Rider, which controls an Atari 2600 arcade space shooter [59], as well as Lunar Lander, which optimizes a rocket trajectory for successful landing [60]. ... For the healthcare application, we use MIMIC-III, a real-world medical dataset consisting of patients treated in intensive care units from the Medical Information Mart for Intensive Care [63]
Dataset Splits Yes Demonstrations D are sampled for use as input to train all algorithms, which are then evaluated using 300 live episodes (for Open AI gym environments) or using a held-out test set (for MIMIC-III). This process is then repeated for a total 50 times (using different D and randomly initialized networks)... For the MIMIC-III dataset, policies are trained and tested on demonstrations by way of cross-validation
Hardware Specification No The paper mentions that policies share the same network architecture and refers to using source code for various algorithms, but it does not specify any hardware details like CPU, GPU models, or memory for running the experiments.
Software Dependencies No The paper refers to 'Open AI gym environments [56]', 'RL Baselines Zoo [61] in Stable Open AI Baselines [62]', and mentions that 'Algorithm 1 is implemented using the source code for joint EBMs [47] publicly available at [66]'. It also mentions using DSFN [64] and VDICE [65]. However, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup Yes Policies trained by all algorithms share the same network architecture: two hidden layers of 64 units each with ELU activation (or for Atari three convolutional layers with Re LU activation). ... VDICE is originally designed for Gaussian actions, so we replace the output layer of the actor with a Gumbel-softmax parameterization; offline learning is enabled by setting the replay regularization coefficient to zero. Algorithm 1 details the EDM optimization procedure, with a buffer B of size , reinitialization frequency δ, and number of iterations , where s0 0(s) is sampled uniformly. ... Algorithm 1: Input: SGLD hyperparameters , σ, PCD hyperparameters , , δ, and mini-batch size N