Environment Generation for Zero-Shot Compositional Reinforcement Learning

Authors: Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, Aleksandra Faust

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results assess learning and generalization across multiple compositional tasks, including the real-world problem of learning to navigate and interact with web pages. We learn to generate environments composed of multiple pages or rooms, and train RL agents capable of completing wide-range of complex tasks in those environments. We contribute two new benchmark frameworks for generating compositional tasks, compositional Mini Grid and g Mini Wo B for web navigation. Co DE yields 4x higher success rate than the strongest baseline, and demonstrates strong performance of real websites learned on 3500 primitive tasks.
Researcher Affiliation Collaboration Google Research, Brain Team Mountain View, California, 94043 {izzeddin, natashajaques, yingjiemiao, mjtiwari, sandrafaust}@google.com {jwook, honglak}@umich.edu
Pseudocode Yes Algorithm 1: Co DE: Joint training of the generator and learner agents.
Open Source Code Yes The implementation of Co DE and g Mini Wo B framework are available in open source at https: //github.com/google-research/google-research/tree/master/compositional_rl.
Open Datasets Yes We contribute two new benchmark frameworks for generating compositional tasks, compositional Mini Grid and g Mini Wo B for web navigation... The implementation of Co DE and g Mini Wo B framework are available in open source at https: //github.com/google-research/google-research/tree/master/compositional_rl. c Mini Grid is a compositional extension of Mini Grid navigation environments [6]... Web navigation evaluation: We evaluate our models on Mini Wo B [36]
Dataset Splits No The paper mentions training and testing on "unseen" environments, but does not specify explicit training/validation/test dataset split percentages, sample counts, or a detailed splitting methodology for reproducibility beyond implying a test set.
Hardware Specification No The paper states "The training is done on a single CPU, requiring about a week of training." but does not provide specific CPU model numbers, memory, or other detailed hardware specifications for reproducibility.
Software Dependencies No The paper states "Co DE is implemented using ACME [18] with Tensor Flow [2] open-source libraries." but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Both the generator and learner agents are trained using RL, specifically A2C [26] with entropy regularization. For every training step, the generator constructs an environment E, and the agents p 2 P collect M trajectories within E. The learner agents are trained using the standard task related reward. To train the generator, we use the following multi-objective loss function that encourages the adversary to control the complexity of the environment by presenting just-the-right challenge" for the agents in P, where is a hyperparameter: J (E, P) = (1 ) J POPREGRET(E, P) + J DIFFICULTY(E, P) (4). The results reported are averaged over 5 seeds. We compute the state-value by using the marginal distribution of elements as attention weights over element encodings and passing the context vector through a feed-forward network.