The MAGICAL Benchmark for Robust Imitation

Authors: Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments in Section 4 demonstrate the brittleness of standard IL algorithms, particularly under large shifts in object position or colour.
Researcher Affiliation Academia Sam Toyer Rohin Shah Andrew Critch Stuart Russell Department of Electrical Engineering and Computer Sciences University of California, Berkeley {sdt,rohinmshah,critch,russell}@berkeley.edu
Pseudocode No The paper describes algorithms such as Behavioral Cloning (BC) and Generative Adversarial IL (GAIL) textually and using mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/. The IL algorithm implementations that we used to generate these results are available on Git Hub,1 as is the MAGICAL benchmark suite and all demonstration data.2 Footnotes: 1Multi-task imitation learning algorithms: https://github.com/qxcv/mtil/ 2Benchmark suite and links to data: https://github.com/qxcv/magical/
Open Datasets Yes Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/. The IL algorithm implementations that we used to generate these results are available on Git Hub,1 as is the MAGICAL benchmark suite and all demonstration data.2
Dataset Splits No The paper states 'In each run, the training dataset for each task consisted of 10 trajectories from the demo variant.' and describes 'test variants' for evaluation, but it does not specify a separate 'validation' dataset split or its characteristics.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No The paper mentions using the 'rlpyt framework [36]' and 'Gym environment [8]', but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Observations were preprocessed by stacking four temporally adjacent RGB frames and resizing them to 96 96 pixels. For multi-task experiments, task-specific weights were used for the final fully-connected layer of each policy/value/discriminator network, but weights of all preceding layers were shared. The BC policy and GAIL discriminator both used translation, rotation, colour jitter, and Gaussian noise augmentations by default. The GAIL policy and value function did not use augmented data, which we found made training unstable. Complete hyperparameters and data collection details are listed in Appendix B.