Learning to Generalize from Sparse and Underspecified Rewards
Authors: Rishabh Agarwal, Chen Liang, Dale Schuurmans, Mohammad Norouzi
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our overall approach (see Figure 3 for an overview) on two real world weakly-supervised semantic parsing benchmarks (Pasupat & Liang, 2015; Zhong et al., 2017) (Figure 1) and a simple instruction following environment (Figure 2). In all of the experiments, we observe a significant benefit from the proposed Meta Reward Learning (Me RL) approach, even when the exploration problem is synthetically mitigated. |
| Researcher Affiliation | Collaboration | 1Google Research, Brain Team 2University of Alberta. |
| Pseudocode | Yes | Algorithm 1 Meta Reward-Learning (Me RL) and Algorithm 2 Bayesian Optimization Reward-Learning (Bo RL) are provided. |
| Open Source Code | Yes | Our open-source implementation can be found at https:// github.com/google-research/google-research/ tree/master/meta_reward_learning. |
| Open Datasets | Yes | We evaluate our approach on two weakly-supervised semantic parsing benchmarks, WIKITABLEQUESTIONS (Pasupat & Liang, 2015) and WIKISQL (Zhong et al., 2017). |
| Dataset Splits | Yes | We use a set of 300 randomly generated environments with (N, K) = (7, 14) with training and validation splits of 80% and 20% respectively. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions building on 'open source code of MAPO' and using 'Batched Gaussian Process Bandits' with specific kernels and functions, but does not provide specific version numbers for these software components or other general dependencies. |
| Experiment Setup | No | The paper mentions training details like using a 'fixed replay buffer' and '5 runs with identical hyperparameters' but does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other system-level training settings in the main text. |