Learning to Understand Goal Specifications by Modelling Reward
Authors: Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Arian Hosseini, Pushmeet Kohli, Edward Grefenstette
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first verify that our method works in settings where a comparison between AGILE-trained policies with policies trained from environment reward is possible, to which end we implement instruction-conditional reward functions. In this setting, we show that the learning speed and performance of A3C agents trained with AGILE reward models is superior to A3C agents trained against environment reward, and comparable to that of true-reward A3C agents supplemented by auxiliary unsupervised reward prediction objectives. To simulate an instruction-learning setting in which implementing a reward function would be problematic, we construct a dataset of instructions and goal-states for the task of building colored orientation-invariant arrangements of blocks. On this task, without us ever having to implement the reward function, the agent trained within AGILE learns to construct arrangements as instructed. Finally, we study how well AGILE s reward model generalises beyond the examples on which it was trained. Our experiments show it can be reused to allow the policy to adapt to changes in the environment. |
| Researcher Affiliation | Collaboration | Dzmitry Bahdanau Mila, Universit e de Montr eal Felix Hill Deep Mind Jan Leike Deep Mind Edward Hughes Deep Mind Arian Hosseini Mila, Universit e de Montr eal Pushmeet Kohli Deep Mind Edward Grefenstette Deep Mind egrefen@fb.com Work done during an internship at Deep Mind. Now at Facebook AI Research. |
| Pseudocode | Yes | A AGILE PSEUDOCODE Algorithm 1 AGILE Discriminator Training Algorithm 2 AGILE Policy Training |
| Open Source Code | No | The paper does not provide a direct link to a code repository or explicitly state that the source code for their method is available. |
| Open Datasets | No | We experiment with AGILE in a grid world environment that we call Grid LU, short for Grid Language Understanding and after the famous SHRDLU world (Winograd, 1972). ... To get training data, we built a generator to produce random instantiations (i.e. any translation, rotation, reflection or color mapping of the illustrated forms) of these goal-state classes, as positive examples for the reward model. |
| Dataset Splits | No | The paper mentions training and test sets (e.g., "held out 10% of the instructions as the test set and used the rest 90% as the training set"), but does not explicitly describe a validation set split or its methodology. |
| Hardware Specification | No | The paper mentions "Compute Canada" in the acknowledgements as a support provider, but does not specify any particular hardware components (e.g., GPU/CPU models, memory) used for the experiments. |
| Software Dependencies | No | We trained the policy πθ and the discriminator Dφ concurrently using RMSProp as the optimizer and Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) as the RL method. ... We used the standard initialisation methods from the Sonnet library. |
| Experiment Setup | Yes | For the purpose of training the policy networks both within AGILE, and for our baseline trained from ground-truth reward rt instead of the modelled reward ˆrt, we used the Asynchronous Advantage Actor-Critic (A3C; Mnih et al., 2016). ... The A3C s hyperparameters γ and λ were set to 0.99 and 0 respectively, i.e. we did not use without temporal difference learning for the baseline network. The length of an episode was 30, but we trained the agent on advantage estimation rollouts of length 15. ... Full experimental details can be found in Appendix D. Table 1: Hyperparameters for the policy and the discriminator for the Grid LU-Relations task. ... learning rate 0.0003 0.0005 decay 0.99 0.9 ϵ 0.1 10 10 grad. norm threshold 40 25 batch size 1 256 rollout length 15 episode length 30 discount 0.99 reward scale 0.1 baseline cost 1.0 reward prediction cost (when used) 1.0 reward prediction batch size 4 num. workers training πθ 15 1 AGILE size of replay buffer B 100000 num. workers training Dφ 1 Regularization entropy weight α 0.01 max. column norm 1 |