reproducibilityindex.ai

Using Natural Language for Reward Shaping in Reinforcement Learning

Authors: Prasoon Goyal, Scott Niekum, Raymond J. Mooney

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with Montezuma s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60% more often on average, compared to learning without language.
Researcher Affiliation	Academia	Prasoon Goyal , Scott Niekum and Raymond J. Mooney The University of Texas at Austin {pgoyal, sniekum, mooney}@cs.utexas.edu
Pseudocode	No	The paper describes the neural network architecture and data processing steps in text and diagrams, but does not provide structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide a specific repository link or an explicit statement about releasing the source code for their methodology. It references a third-party PyTorch implementation of RL algorithms, but this is not their own code release.
Open Datasets	Yes	In our experiments, we used 20 trajectories from the Atari Grand Challenge dataset [Kurin et al., 2017], which contains hundreds of crowd-sourced trajectories of human gameplays on 5 Atari games, including Montezuma s Revenge.
Dataset Splits	Yes	The (trajectory, language) pairs were split into training and validation sets, such that there is no overlap between the frames in the training set and the validation set. In particular, Level 1 of Montezuma s revenge consists of 24 rooms, of which we use 14 for training, and the remaining 10 for validation and testing. ... We create a training dataset with 160,000 (action-frequency vector, language) pairs from the training set, and a validation dataset with 40,000 pairs from the validation set, which were used to train LEARN.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory, cloud instances) used for running its experiments.
Software Dependencies	No	The paper mentions software components and models like "Infer Sent", "GloVe word embeddings", "GRU encoder", "Adam optimizer" and "Proximal Policy Optimization" but does not specify version numbers for any of these libraries, frameworks (e.g., PyTorch, TensorFlow), or specific implementations used.
Experiment Setup	Yes	D1, D2 and D3 were tuned using validation data. We used backpropagation with an Adam optimizer [Kingma and Ba, 2014] to train the above neural network for 50 epochs to minimize cross-entropy loss. ... We train the policy for 500,000 timesteps for all our experiments. ... The type of language encoder and the hyperparameter λ are selected using validation as follows.