Environment Generation for Zero-Shot Compositional Reinforcement Learning
Authors: Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, Aleksandra Faust
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results assess learning and generalization across multiple compositional tasks, including the real-world problem of learning to navigate and interact with web pages. We learn to generate environments composed of multiple pages or rooms, and train RL agents capable of completing wide-range of complex tasks in those environments. We contribute two new benchmark frameworks for generating compositional tasks, compositional Mini Grid and g Mini Wo B for web navigation. Co DE yields 4x higher success rate than the strongest baseline, and demonstrates strong performance of real websites learned on 3500 primitive tasks. |
| Researcher Affiliation | Collaboration | Google Research, Brain Team Mountain View, California, 94043 {izzeddin, natashajaques, yingjiemiao, mjtiwari, sandrafaust}@google.com {jwook, honglak}@umich.edu |
| Pseudocode | Yes | Algorithm 1: Co DE: Joint training of the generator and learner agents. |
| Open Source Code | Yes | The implementation of Co DE and g Mini Wo B framework are available in open source at https: //github.com/google-research/google-research/tree/master/compositional_rl. |
| Open Datasets | Yes | We contribute two new benchmark frameworks for generating compositional tasks, compositional Mini Grid and g Mini Wo B for web navigation... The implementation of Co DE and g Mini Wo B framework are available in open source at https: //github.com/google-research/google-research/tree/master/compositional_rl. c Mini Grid is a compositional extension of Mini Grid navigation environments [6]... Web navigation evaluation: We evaluate our models on Mini Wo B [36] |
| Dataset Splits | No | The paper mentions training and testing on "unseen" environments, but does not specify explicit training/validation/test dataset split percentages, sample counts, or a detailed splitting methodology for reproducibility beyond implying a test set. |
| Hardware Specification | No | The paper states "The training is done on a single CPU, requiring about a week of training." but does not provide specific CPU model numbers, memory, or other detailed hardware specifications for reproducibility. |
| Software Dependencies | No | The paper states "Co DE is implemented using ACME [18] with Tensor Flow [2] open-source libraries." but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Both the generator and learner agents are trained using RL, specifically A2C [26] with entropy regularization. For every training step, the generator constructs an environment E, and the agents p 2 P collect M trajectories within E. The learner agents are trained using the standard task related reward. To train the generator, we use the following multi-objective loss function that encourages the adversary to control the complexity of the environment by presenting just-the-right challenge" for the agents in P, where is a hyperparameter: J (E, P) = (1 ) J POPREGRET(E, P) + J DIFFICULTY(E, P) (4). The results reported are averaged over 5 seeds. We compute the state-value by using the marginal distribution of elements as attention weights over element encodings and passing the context vector through a feed-forward network. |