Impossibly Good Experts and How to Follow Them
Authors: Aaron Walsman, Muru Zhang, Sanjiban Choudhury, Dieter Fox, Ali Farhadi
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that this algorithm performs better than a variety of strong baselines on a challenging suite of Minigrid and Vizdoom environments. In order to evaluate ELF Distill, we compare it against the baselines in Section 5 on several challenging Minigrid (Chevalier-Boisvert et al., 2018) and Vizdoom (Wydmuch et al., 2019) environments with partial information. |
| Researcher Affiliation | Collaboration | Aaron Walsman1, Muru Zhang1, Sanjiban Choudhury2, Ali Farhadi1, Dieter Fox1,3 Computer Science and Engineering, University of Washington1 Computer Science, Cornell University2 NVIDIA3 |
| Pseudocode | Yes | Pseudocode is shown in Algorithm 1. |
| Open Source Code | Yes | Code for these experiments can be found at https://github.com/aaronwalsman/impossibly-good/. |
| Open Datasets | Yes | In order to study these problems, we have constructed a suite of Minigrid(Chevalier-Boisvert et al., 2018) and Vizdoom(Wydmuch et al., 2019) environments that clearly demonstrate the challenges of learning from impossibly good experts. |
| Dataset Splits | No | The paper describes training frames and test time evaluation, but it does not specify explicit training/validation/test dataset splits or mention a separate validation set. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware used to run the experiments (e.g., CPU, GPU models, or memory). |
| Software Dependencies | No | The paper mentions using PPO but does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | Each method was trained on 220 1M frames. When training ELF Distill, both the follower and explorer were trained on 219 500K frames, so that the total frames observed during training was equal to the other methods. All methods were trained with 10 different random seeds in each environment. PPO (Schulman et al., 2017) was used as the loss function to maximize reward in all algorithms with a bri term. All minigrid methods train the same small model that takes a 3x3 grid with two channels representing object type (wall, door, etc) and a single color. |