reproducibilityindex.ai

Learning to Generalize from Sparse and Underspecified Rewards

Authors: Rishabh Agarwal, Chen Liang, Dale Schuurmans, Mohammad Norouzi

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our overall approach (see Figure 3 for an overview) on two real world weakly-supervised semantic parsing benchmarks (Pasupat & Liang, 2015; Zhong et al., 2017) (Figure 1) and a simple instruction following environment (Figure 2). In all of the experiments, we observe a signiﬁcant beneﬁt from the proposed Meta Reward Learning (Me RL) approach, even when the exploration problem is synthetically mitigated.
Researcher Affiliation	Collaboration	1Google Research, Brain Team 2University of Alberta.
Pseudocode	Yes	Algorithm 1 Meta Reward-Learning (Me RL) and Algorithm 2 Bayesian Optimization Reward-Learning (Bo RL) are provided.
Open Source Code	Yes	Our open-source implementation can be found at https:// github.com/google-research/google-research/ tree/master/meta_reward_learning.
Open Datasets	Yes	We evaluate our approach on two weakly-supervised semantic parsing benchmarks, WIKITABLEQUESTIONS (Pasupat & Liang, 2015) and WIKISQL (Zhong et al., 2017).
Dataset Splits	Yes	We use a set of 300 randomly generated environments with (N, K) = (7, 14) with training and validation splits of 80% and 20% respectively.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions building on 'open source code of MAPO' and using 'Batched Gaussian Process Bandits' with specific kernels and functions, but does not provide specific version numbers for these software components or other general dependencies.
Experiment Setup	No	The paper mentions training details like using a 'fixed replay buffer' and '5 runs with identical hyperparameters' but does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or other system-level training settings in the main text.