Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains
Authors: Michael Dann, Fabio Zambetta, John Thangarajah881-889
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply this approach to three Atari games with sparse rewards (Venture, Freeway and Montezuma s Revenge), achieving similar performance to state-of-the-art methods based on visual density models (Bellemare et al. 2016a; Ostrovski et al. 2017). We benchmarked our method against a configuration that was identical in every respect except that pellet rewards were turned off. For each game, we conducted 5 training runs per agent. To make it easier to see the effect of pellet rewards on learning progress, we have plotted all training curves in Figures 1 and 2 from the point where Q-learning commenced. |
| Researcher Affiliation | Academia | Michael Dann, Fabio Zambetta, John Thangarajah Computer Science RMIT University, Australia {michael.dann, fabio.zambetta, john.thangarajah}@rmit.edu.au |
| Pseudocode | Yes | Algorithm 1 Q-learning with Pellet Rewards |
| Open Source Code | Yes | All other settings can be found in our source code3. 3https://bitbucket.org/mchldann/aaai2019 |
| Open Datasets | Yes | We used version 0.6 of the Arcade Learning Environment (ALE) (Bellemare et al. 2013) |
| Dataset Splits | No | The paper mentions training durations in frames and episodes but does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts). |
| Hardware Specification | No | The paper describes model architecture and training parameters but does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | Yes | We used version 0.6 of the Arcade Learning Environment (ALE) (Bellemare et al. 2013) |
| Experiment Setup | Yes | Exploration Effort Training Parameters. The auxiliary reward scale factor, κ, was set to 1. The time separation constant, m, was set to 100. Both the EE and Q-function were trained via mixed Monte Carlo updates with η = 0.1. Prior to commencing Q-learning, the EE function was trained for 8 million frames (2 million samples) on experience generated via a uniform random policy. Partition / Pellet Configuration. We set the pellet reward scale factor, β, to 1, but clipped bonuses to a maximum of 0.1. The time between partition additions was initially set to 80,000 frames, then increased by 20% with each addition. Experience collection for Q-learning commenced once there were 5 partitions in existence. |