reproducibilityindex.ai

Hybrid Reward Architecture for Reinforcement Learning

Authors: Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, Jeffrey Tsang

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance. We test our approach on two domains: a toy-problem, where an agent has to eat 5 randomly located fruits, and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al., 2013). Section 4 is titled 'Experiments'.
Researcher Affiliation	Collaboration	1Microsoft Maluuba, Montreal, Canada 2Mc Gill University, Montreal, Canada
Pseudocode	No	The paper includes equations and an architectural diagram, but no structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about the release of source code or a link to a code repository. The footnote link is to a YouTube video of the game, not code.
Open Datasets	Yes	We test our approach on two domains: a toy-problem... and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al., 2013). The Arcade Learning Environment (ALE) is a well-known public benchmark dataset.
Dataset Splits	No	The paper discusses training and evaluation metrics but does not explicitly provide training/validation/test dataset splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions methods like DQN and A3C, but does not provide specific software dependencies or library version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup	Yes	The network consists of a binary input layer of length 110, encoding the agent s position and whether there is a fruit on each location. This is followed by a fully connected hidden layer of length 250. This layer is connected to 10 heads consisting of 4 linear nodes each, representing the action-values of the 4 actions under the different reward functions. We optimised the step-size and the discount factor for each method separately. We train A3C for 800 million frames. Because HRA learns fast, we train it only for 5,000 episodes, corresponding with about 150 million frames.