Strictly Batch Imitation Learning by Energy-based Distribution Matching
Authors: Daniel Jarrett, Ioana Bica, Mihaela van der Schaar
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments with application to control and healthcare settings, we illustrate consistent performance gains over existing algorithms for strictly batch imitation learning. |
| Researcher Affiliation | Academia | Daniel Jarrett Ioana Bica Mihaela van der Schaar University of Cambridge University of Oxford University of Cambridge daniel.jarrett@maths.cam.ac.uk The Alan Turing Institute Universityof California,Los Angeles ioana.bica@eng.ox.ac.uk The Alan Turing Institute mv472@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1 Energy-based Distribution Matching . for Strictly Batch Imitation Learning |
| Open Source Code | Yes | Algorithm 1 is implemented using the source code for joint EBMs [47] publicly available at [66], which already contains an implementation of SGLD. |
| Open Datasets | Yes | For the former, we use Open AI gym environments [56] of varying complexity from standard RL literature: Cart Pole, which balances a pendulum on a frictionless track [57], Acrobot, which swings a system of joints up to a given height [58], Beam Rider, which controls an Atari 2600 arcade space shooter [59], as well as Lunar Lander, which optimizes a rocket trajectory for successful landing [60]. ... For the healthcare application, we use MIMIC-III, a real-world medical dataset consisting of patients treated in intensive care units from the Medical Information Mart for Intensive Care [63] |
| Dataset Splits | Yes | Demonstrations D are sampled for use as input to train all algorithms, which are then evaluated using 300 live episodes (for Open AI gym environments) or using a held-out test set (for MIMIC-III). This process is then repeated for a total 50 times (using different D and randomly initialized networks)... For the MIMIC-III dataset, policies are trained and tested on demonstrations by way of cross-validation |
| Hardware Specification | No | The paper mentions that policies share the same network architecture and refers to using source code for various algorithms, but it does not specify any hardware details like CPU, GPU models, or memory for running the experiments. |
| Software Dependencies | No | The paper refers to 'Open AI gym environments [56]', 'RL Baselines Zoo [61] in Stable Open AI Baselines [62]', and mentions that 'Algorithm 1 is implemented using the source code for joint EBMs [47] publicly available at [66]'. It also mentions using DSFN [64] and VDICE [65]. However, specific version numbers for these software dependencies are not provided in the text. |
| Experiment Setup | Yes | Policies trained by all algorithms share the same network architecture: two hidden layers of 64 units each with ELU activation (or for Atari three convolutional layers with Re LU activation). ... VDICE is originally designed for Gaussian actions, so we replace the output layer of the actor with a Gumbel-softmax parameterization; offline learning is enabled by setting the replay regularization coefficient to zero. Algorithm 1 details the EDM optimization procedure, with a buffer B of size , reinitialization frequency δ, and number of iterations , where s0 0(s) is sampled uniformly. ... Algorithm 1: Input: SGLD hyperparameters , σ, PCD hyperparameters , , δ, and mini-batch size N |