Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains

Authors: Michael Dann, Fabio Zambetta, John Thangarajah881-889

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply this approach to three Atari games with sparse rewards (Venture, Freeway and Montezuma s Revenge), achieving similar performance to state-of-the-art methods based on visual density models (Bellemare et al. 2016a; Ostrovski et al. 2017). We benchmarked our method against a configuration that was identical in every respect except that pellet rewards were turned off. For each game, we conducted 5 training runs per agent. To make it easier to see the effect of pellet rewards on learning progress, we have plotted all training curves in Figures 1 and 2 from the point where Q-learning commenced.
Researcher Affiliation Academia Michael Dann, Fabio Zambetta, John Thangarajah Computer Science RMIT University, Australia {michael.dann, fabio.zambetta, john.thangarajah}@rmit.edu.au
Pseudocode Yes Algorithm 1 Q-learning with Pellet Rewards
Open Source Code Yes All other settings can be found in our source code3. 3https://bitbucket.org/mchldann/aaai2019
Open Datasets Yes We used version 0.6 of the Arcade Learning Environment (ALE) (Bellemare et al. 2013)
Dataset Splits No The paper mentions training durations in frames and episodes but does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts).
Hardware Specification No The paper describes model architecture and training parameters but does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies Yes We used version 0.6 of the Arcade Learning Environment (ALE) (Bellemare et al. 2013)
Experiment Setup Yes Exploration Effort Training Parameters. The auxiliary reward scale factor, κ, was set to 1. The time separation constant, m, was set to 100. Both the EE and Q-function were trained via mixed Monte Carlo updates with η = 0.1. Prior to commencing Q-learning, the EE function was trained for 8 million frames (2 million samples) on experience generated via a uniform random policy. Partition / Pellet Configuration. We set the pellet reward scale factor, β, to 1, but clipped bonuses to a maximum of 0.1. The time between partition additions was initially set to 80,000 frames, then increased by 20% with each addition. Experience collection for Q-learning commenced once there were 5 partitions in existence.