Impossibly Good Experts and How to Follow Them

Authors: Aaron Walsman, Muru Zhang, Sanjiban Choudhury, Dieter Fox, Ali Farhadi

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this algorithm performs better than a variety of strong baselines on a challenging suite of Minigrid and Vizdoom environments. In order to evaluate ELF Distill, we compare it against the baselines in Section 5 on several challenging Minigrid (Chevalier-Boisvert et al., 2018) and Vizdoom (Wydmuch et al., 2019) environments with partial information.
Researcher Affiliation Collaboration Aaron Walsman1, Muru Zhang1, Sanjiban Choudhury2, Ali Farhadi1, Dieter Fox1,3 Computer Science and Engineering, University of Washington1 Computer Science, Cornell University2 NVIDIA3
Pseudocode Yes Pseudocode is shown in Algorithm 1.
Open Source Code Yes Code for these experiments can be found at https://github.com/aaronwalsman/impossibly-good/.
Open Datasets Yes In order to study these problems, we have constructed a suite of Minigrid(Chevalier-Boisvert et al., 2018) and Vizdoom(Wydmuch et al., 2019) environments that clearly demonstrate the challenges of learning from impossibly good experts.
Dataset Splits No The paper describes training frames and test time evaluation, but it does not specify explicit training/validation/test dataset splits or mention a separate validation set.
Hardware Specification No The paper does not provide any specific details regarding the hardware used to run the experiments (e.g., CPU, GPU models, or memory).
Software Dependencies No The paper mentions using PPO but does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes Each method was trained on 220 1M frames. When training ELF Distill, both the follower and explorer were trained on 219 500K frames, so that the total frames observed during training was equal to the other methods. All methods were trained with 10 different random seeds in each environment. PPO (Schulman et al., 2017) was used as the loss function to maximize reward in all algorithms with a bri term. All minigrid methods train the same small model that takes a 3x3 grid with two channels representing object type (wall, door, etc) and a single color.