reproducibilityindex.ai

The MAGICAL Benchmark for Robust Imitation

Authors: Sam Toyer, Rohin Shah, Andrew Critch, Stuart Russell

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments in Section 4 demonstrate the brittleness of standard IL algorithms, particularly under large shifts in object position or colour.
Researcher Affiliation	Academia	Sam Toyer Rohin Shah Andrew Critch Stuart Russell Department of Electrical Engineering and Computer Sciences University of California, Berkeley {sdt,rohinmshah,critch,russell}@berkeley.edu
Pseudocode	No	The paper describes algorithms such as Behavioral Cloning (BC) and Generative Adversarial IL (GAIL) textually and using mathematical equations, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/. The IL algorithm implementations that we used to generate these results are available on Git Hub,1 as is the MAGICAL benchmark suite and all demonstration data.2 Footnotes: 1Multi-task imitation learning algorithms: https://github.com/qxcv/mtil/ 2Benchmark suite and links to data: https://github.com/qxcv/magical/
Open Datasets	Yes	Code and data for the MAGICAL suite is available at https://github.com/qxcv/magical/. The IL algorithm implementations that we used to generate these results are available on Git Hub,1 as is the MAGICAL benchmark suite and all demonstration data.2
Dataset Splits	No	The paper states 'In each run, the training dataset for each task consisted of 10 trajectories from the demo variant.' and describes 'test variants' for evaluation, but it does not specify a separate 'validation' dataset split or its characteristics.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions using the 'rlpyt framework [36]' and 'Gym environment [8]', but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Observations were preprocessed by stacking four temporally adjacent RGB frames and resizing them to 96 96 pixels. For multi-task experiments, task-speciﬁc weights were used for the ﬁnal fully-connected layer of each policy/value/discriminator network, but weights of all preceding layers were shared. The BC policy and GAIL discriminator both used translation, rotation, colour jitter, and Gaussian noise augmentations by default. The GAIL policy and value function did not use augmented data, which we found made training unstable. Complete hyperparameters and data collection details are listed in Appendix B.