Scaling All-Goals Updates in Reinforcement Learning Using Convolutional Neural Networks
Authors: Fabio Pardo, Vitaly Levdik, Petar Kormushev5355-5362
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the accuracy and generalization qualities of the proposed method on randomly generated mazes and Sokoban puzzles. In the case of on-screen goal coordinates the resulting mapping from frames to distancemaps directly informs the agent about which places are reachable and in how many steps. As an example of application we show that replacing the random actions in ε-greedy exploration by several actions towards feasible goals generates better exploratory trajectories on Montezuma s Revenge and Super Mario All-Stars games. |
| Researcher Affiliation | Academia | Fabio Pardo, Vitaly Levdik, Petar Kormushev Robot Intelligence Lab, Imperial College London, United Kingdom {f.pardo, v.levdik, p.kormushev}@imperial.ac.uk |
| Pseudocode | No | The paper describes the training process and model usage but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and videos are available on the website: https://sites.google.com/view/q-map-rl. |
| Open Datasets | No | The paper uses generated maze environments and levels from Gym-Sokoban and OpenAI Retro, but does not provide concrete access information (e.g., a link or citation to a specific dataset repository) for the generated training data or collected transitions used in their experiments. It mentions 'Gym-Sokoban (Schrader 2018)' and 'Open AI. 2018. Open AI Retro. https://github.com/openai/retro', which are environments/frameworks, but not the specific datasets of transitions they created and used. |
| Dataset Splits | No | For the random mazes, the paper mentions a 'training set' and a 'testing set' (1,000,220 transitions and 1,075 observations, respectively), but does not explicitly state a validation set or other specific dataset splits for reproduction across all experiments. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like Gym-Sokoban and OpenAI Retro, but does not provide specific version numbers for these or any other software dependencies (e.g., Python, deep learning frameworks, or libraries) used in the experiments. |
| Experiment Setup | Yes | In all of the experiments we use γ = 0.9 for the goal-reaching Q-functions and the neural networks are described using the notations: conv(filters, kernel sizes, strides) for convolutions (with padding same unless stated otherwise), deconv2d for transposed convolutions and dense(units) for dense layers. Elu activation functions are used for every layer except for the output ones. For the baseline model we use a green pixel to represent the goal. The agents are trained with batches of 50 random transitions from the training set. All the models are trained on the same batch size of 100. A random goal is chosen within 15 to 30 predicted steps from the agent s current position. An individual goal-directed trajectory terminates upon either reaching the goal or exceeding 150% of the original predicted number of steps. Furthermore, there is a chance to take a random action decayed linearly from 0.1 to 0.05. The rewards from the game are divided by 100 with 0 bonus for moving to the right or penalty for game overs. The exploratory schedule linearly decreases from 100% to 5%. For the proposed agent, the random action probability is decreased from 10% to 5% over the course of the training. Furthermore, to focus the exploration towards the task-learner policy a 50% chance to select the goals with a first greedy action identical to the one from the task-learner is introduced. |