reproducibilityindex.ai

Explore to Generalize in Zero-Shot RL

Authors: Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that our approach is the state-of-the-art on tasks of the Proc Gen challenge that have thus far eluded effective generalization, yielding a success rate of 83% on the Maze task and 74% on Heist with 200 training levels. Our experimental setup follows Proc Gen s easy configuration, wherein agents are trained on 200 levels for 25M steps and subsequently tested on random levels [Cobbe et al., 2020].
Researcher Affiliation	Academia	Ev Zisselman , Itai Lavie, Daniel Soudry, Aviv Tamar Technion Israel Institute of Technology Correspondence E-mail: ev_zis@campus.technion.ac.il
Pseudocode	Yes	Algorithm 1 Explore to Generalize (Exp Gen)
Open Source Code	Yes	Code available at https://github.com/Ev Zissel/expgen.
Open Datasets	Yes	A standard evaluation suite for ZSG-RL is the Proc Gen benchmark [Cobbe et al., 2020], containing 16 games, each with levels that are procedurally generated to vary in visual properties...
Dataset Splits	No	The paper mentions training and testing phases but does not explicitly specify a separate 'validation' dataset split or its size/proportion. It focuses on 'train and test return scores'.
Hardware Specification	No	The paper discusses software architectures (like IMPALA) and training steps, but does not specify any particular hardware components such as GPU models, CPU types, or memory sizes.
Software Dependencies	No	All agents are implemented using the IMPALA convolutional architecture [Espeholt et al., 2018], and trained using PPO [Schulman et al., 2017] or IDAAC [Raileanu and Fergus, 2021]. For the maximum entropy agent πH we incorporate a single GRU [Cho et al., 2014]... Throughout our experiments, we train our networks using the Adam optimizer [Kingma and Ba, 2014]. No specific version numbers are provided for these software components.
Experiment Setup	Yes	For all games, we use the same parameter α = 0.5 of the Geometric distribution and form an ensemble of 10 networks. For the PPO hyperparameters we use the hyperparameters found in [Cobbe et al., 2020] as detailed in Table 6. Table 6 then lists parameters like 'γ .999', 'λ .95', 'Learning rate 5e-4', 'Total timesteps 25M', etc.