Few-Shot Bayesian Imitation Learning with Logical Program Policies

Authors: Tom Silver, Kelsey R. Allen, Alex K. Lew, Leslie Pack Kaelbling, Josh Tenenbaum10251-10258

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we study six strategy games played on a 2D grid with one shared DSL. After a few demonstrations of each game, the inferred policies generalize to new game instances that differ substantially from the demonstrations. Our policy learning is 20 1,000x more data efficient than convolutional and fully convolutional policy learning and many orders of magnitude more computationally efficient than vanilla program induction. We argue that the proposed method is an apt choice for tasks that have scarce training data and feature significant, structured variation between task instances.
Researcher Affiliation Academia Tom Silver, Kelsey R. Allen, Alex K. Lew, Leslie Kaelbling, Josh Tenenbaum Massachusetts Institute of Technology {tslvr, krallen, alexlew, lpk, jbt}@mit.edu
Pseudocode Yes Algorithm 1: LPP imitation learning input: Demos D, ensemble size K, max iters L Create anti-demos D = {(s, a ) : (s, a) D, a = a}; Set labels y[(s, a)] = 1 if (s, a) D else 0; Initialize approximate posterior q; for i in 1, ..., L do fi = generate next feature(); X = {(f1(s, a), ..., fi(s, a))T : (s, a) D D} μi, wi = logical inference(X, y, p(f), K); update posterior(q, μi, wi); end return q;
Open Source Code No The paper does not provide any concrete access to source code for the described methodology. It does not mention a repository link or explicitly state that the code is being released.
Open Datasets No The paper describes using 'six strategy games' and states 'Instances of Nim, Checkmate Tactic, and Reach for the Star are procedurally generated; instances of Stop the Fall, Chase, and Fence In are manually generated'. However, it does not provide any links, DOIs, or citations to publicly available datasets or repositories for these game instances.
Dataset Splits Yes For each number of demonstrations, we run leave-one-out cross validation: 10 trials, each featuring a distinct set of demonstrations drawn from the overall pool of 11 training demonstrations.
Hardware Specification Yes All experiments were performed on a single laptop running mac OS Mojave with a 2.9 GHz Intel Core i9 processor and 32 GB of memory.
Software Dependencies No The paper mentions using 'an off-the-shelf stochastic greedy decision-tree learner (Pedregosa et al. 2011)' which refers to scikit-learn, but it does not specify version numbers for scikit-learn or any other software libraries or programming languages used.
Experiment Setup Yes LPP learning is run for 10,000 iterations for each task. The network has 8 convolutional layers with kernel size 3, stride 1, padding 1, 4 channels (8 in the input layer), and Re LU nonlinearities. The architecture is: 64-channel convolution; max pooling; 64-channel fully-connected layer; |A|channel fully-connected layer. All kernels have size 3 and all strides and paddings are 1.