Explore to Generalize in Zero-Shot RL

Authors: Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our approach is the state-of-the-art on tasks of the Proc Gen challenge that have thus far eluded effective generalization, yielding a success rate of 83% on the Maze task and 74% on Heist with 200 training levels. Our experimental setup follows Proc Gen s easy configuration, wherein agents are trained on 200 levels for 25M steps and subsequently tested on random levels [Cobbe et al., 2020].
Researcher Affiliation Academia Ev Zisselman , Itai Lavie, Daniel Soudry, Aviv Tamar Technion Israel Institute of Technology Correspondence E-mail: ev_zis@campus.technion.ac.il
Pseudocode Yes Algorithm 1 Explore to Generalize (Exp Gen)
Open Source Code Yes Code available at https://github.com/Ev Zissel/expgen.
Open Datasets Yes A standard evaluation suite for ZSG-RL is the Proc Gen benchmark [Cobbe et al., 2020], containing 16 games, each with levels that are procedurally generated to vary in visual properties...
Dataset Splits No The paper mentions training and testing phases but does not explicitly specify a separate 'validation' dataset split or its size/proportion. It focuses on 'train and test return scores'.
Hardware Specification No The paper discusses software architectures (like IMPALA) and training steps, but does not specify any particular hardware components such as GPU models, CPU types, or memory sizes.
Software Dependencies No All agents are implemented using the IMPALA convolutional architecture [Espeholt et al., 2018], and trained using PPO [Schulman et al., 2017] or IDAAC [Raileanu and Fergus, 2021]. For the maximum entropy agent πH we incorporate a single GRU [Cho et al., 2014]... Throughout our experiments, we train our networks using the Adam optimizer [Kingma and Ba, 2014]. No specific version numbers are provided for these software components.
Experiment Setup Yes For all games, we use the same parameter α = 0.5 of the Geometric distribution and form an ensemble of 10 networks. For the PPO hyperparameters we use the hyperparameters found in [Cobbe et al., 2020] as detailed in Table 6. Table 6 then lists parameters like 'γ .999', 'λ .95', 'Learning rate 5e-4', 'Total timesteps 25M', etc.