Hybrid Reward Architecture for Reinforcement Learning

Authors: Harm Van Seijen, Mehdi Fatemi, Joshua Romoff, Romain Laroche, Tavian Barnes, Jeffrey Tsang

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance. We test our approach on two domains: a toy-problem, where an agent has to eat 5 randomly located fruits, and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al., 2013). Section 4 is titled 'Experiments'.
Researcher Affiliation Collaboration 1Microsoft Maluuba, Montreal, Canada 2Mc Gill University, Montreal, Canada
Pseudocode No The paper includes equations and an architectural diagram, but no structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository. The footnote link is to a YouTube video of the game, not code.
Open Datasets Yes We test our approach on two domains: a toy-problem... and Ms. Pac-Man, one of the hard games from the ALE benchmark set (Bellemare et al., 2013). The Arcade Learning Environment (ALE) is a well-known public benchmark dataset.
Dataset Splits No The paper discusses training and evaluation metrics but does not explicitly provide training/validation/test dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions methods like DQN and A3C, but does not provide specific software dependencies or library version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes The network consists of a binary input layer of length 110, encoding the agent s position and whether there is a fruit on each location. This is followed by a fully connected hidden layer of length 250. This layer is connected to 10 heads consisting of 4 linear nodes each, representing the action-values of the 4 actions under the different reward functions. We optimised the step-size and the discount factor for each method separately. We train A3C for 800 million frames. Because HRA learns fast, we train it only for 5,000 episodes, corresponding with about 150 million frames.