Explore to Generalize in Zero-Shot RL
Authors: Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our approach is the state-of-the-art on tasks of the Proc Gen challenge that have thus far eluded effective generalization, yielding a success rate of 83% on the Maze task and 74% on Heist with 200 training levels. Our experimental setup follows Proc Gen s easy configuration, wherein agents are trained on 200 levels for 25M steps and subsequently tested on random levels [Cobbe et al., 2020]. |
| Researcher Affiliation | Academia | Ev Zisselman , Itai Lavie, Daniel Soudry, Aviv Tamar Technion Israel Institute of Technology Correspondence E-mail: ev_zis@campus.technion.ac.il |
| Pseudocode | Yes | Algorithm 1 Explore to Generalize (Exp Gen) |
| Open Source Code | Yes | Code available at https://github.com/Ev Zissel/expgen. |
| Open Datasets | Yes | A standard evaluation suite for ZSG-RL is the Proc Gen benchmark [Cobbe et al., 2020], containing 16 games, each with levels that are procedurally generated to vary in visual properties... |
| Dataset Splits | No | The paper mentions training and testing phases but does not explicitly specify a separate 'validation' dataset split or its size/proportion. It focuses on 'train and test return scores'. |
| Hardware Specification | No | The paper discusses software architectures (like IMPALA) and training steps, but does not specify any particular hardware components such as GPU models, CPU types, or memory sizes. |
| Software Dependencies | No | All agents are implemented using the IMPALA convolutional architecture [Espeholt et al., 2018], and trained using PPO [Schulman et al., 2017] or IDAAC [Raileanu and Fergus, 2021]. For the maximum entropy agent πH we incorporate a single GRU [Cho et al., 2014]... Throughout our experiments, we train our networks using the Adam optimizer [Kingma and Ba, 2014]. No specific version numbers are provided for these software components. |
| Experiment Setup | Yes | For all games, we use the same parameter α = 0.5 of the Geometric distribution and form an ensemble of 10 networks. For the PPO hyperparameters we use the hyperparameters found in [Cobbe et al., 2020] as detailed in Table 6. Table 6 then lists parameters like 'γ .999', 'λ .95', 'Learning rate 5e-4', 'Total timesteps 25M', etc. |