Using Natural Language for Reward Shaping in Reinforcement Learning

Authors: Prasoon Goyal, Scott Niekum, Raymond J. Mooney

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with Montezuma s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60% more often on average, compared to learning without language.
Researcher Affiliation Academia Prasoon Goyal , Scott Niekum and Raymond J. Mooney The University of Texas at Austin {pgoyal, sniekum, mooney}@cs.utexas.edu
Pseudocode No The paper describes the neural network architecture and data processing steps in text and diagrams, but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a specific repository link or an explicit statement about releasing the source code for their methodology. It references a third-party PyTorch implementation of RL algorithms, but this is not their own code release.
Open Datasets Yes In our experiments, we used 20 trajectories from the Atari Grand Challenge dataset [Kurin et al., 2017], which contains hundreds of crowd-sourced trajectories of human gameplays on 5 Atari games, including Montezuma s Revenge.
Dataset Splits Yes The (trajectory, language) pairs were split into training and validation sets, such that there is no overlap between the frames in the training set and the validation set. In particular, Level 1 of Montezuma s revenge consists of 24 rooms, of which we use 14 for training, and the remaining 10 for validation and testing. ... We create a training dataset with 160,000 (action-frequency vector, language) pairs from the training set, and a validation dataset with 40,000 pairs from the validation set, which were used to train LEARN.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory, cloud instances) used for running its experiments.
Software Dependencies No The paper mentions software components and models like "Infer Sent", "GloVe word embeddings", "GRU encoder", "Adam optimizer" and "Proximal Policy Optimization" but does not specify version numbers for any of these libraries, frameworks (e.g., PyTorch, TensorFlow), or specific implementations used.
Experiment Setup Yes D1, D2 and D3 were tuned using validation data. We used backpropagation with an Adam optimizer [Kingma and Ba, 2014] to train the above neural network for 50 epochs to minimize cross-entropy loss. ... We train the policy for 500,000 timesteps for all our experiments. ... The type of language encoder and the hyperparameter λ are selected using validation as follows.