Learning to Generalize from Sparse and Underspecified Rewards

Authors: Rishabh Agarwal, Chen Liang, Dale Schuurmans, Mohammad Norouzi

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our overall approach (see Figure 3 for an overview) on two real world weakly-supervised semantic parsing benchmarks (Pasupat & Liang, 2015; Zhong et al., 2017) (Figure 1) and a simple instruction following environment (Figure 2). In all of the experiments, we observe a significant benefit from the proposed Meta Reward Learning (Me RL) approach, even when the exploration problem is synthetically mitigated.
Researcher Affiliation Collaboration 1Google Research, Brain Team 2University of Alberta.
Pseudocode Yes Algorithm 1 Meta Reward-Learning (Me RL) and Algorithm 2 Bayesian Optimization Reward-Learning (Bo RL) are provided.
Open Source Code Yes Our open-source implementation can be found at https:// github.com/google-research/google-research/ tree/master/meta_reward_learning.
Open Datasets Yes We evaluate our approach on two weakly-supervised semantic parsing benchmarks, WIKITABLEQUESTIONS (Pasupat & Liang, 2015) and WIKISQL (Zhong et al., 2017).
Dataset Splits Yes We use a set of 300 randomly generated environments with (N, K) = (7, 14) with training and validation splits of 80% and 20% respectively.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions building on 'open source code of MAPO' and using 'Batched Gaussian Process Bandits' with specific kernels and functions, but does not provide specific version numbers for these software components or other general dependencies.
Experiment Setup No The paper mentions training details like using a 'fixed replay buffer' and '5 runs with identical hyperparameters' but does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other system-level training settings in the main text.