Composing Value Functions in Reinforcement Learning
Authors: Benjamin Van Niekerk, Steven James, Adam Earle, Benjamin Rosman
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate composition, we perform a series of experiments in a high-dimensional video game (Figure 1b).Results show that an agent is able to compose existing policies learned from high-dimensional pixel input to generate new, optimal behaviours. |
| Researcher Affiliation | Academia | 1School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa 2Council for Scientiļ¬c and Industrial Research, Pretoria, South Africa. |
| Pseudocode | Yes | Algorithm 1 Soft Value Iteration, Algorithm 2 Soft Policy Iteration |
| Open Source Code | No | The paper does not provide a link to source code or explicitly state that source code for the methodology is being released. |
| Open Datasets | No | The paper describes custom tasks within a video game domain developed for the experiments, but does not provide access information for a publicly available dataset. It states 'We construct a number of different tasks based on the objects that the agent must collect'. |
| Dataset Splits | No | The paper describes training and evaluation on a custom video game environment (e.g., 'Each network is trained for 1.5m timesteps', 'Returns from 50k episodes'), but does not specify explicit training, validation, or test dataset splits. |
| Hardware Specification | No | No specific hardware details (such as GPU or CPU models, or cloud computing specifications) used for running experiments were mentioned. |
| Software Dependencies | No | The paper mentions 'soft) deep Q-learning' but does not specify any software names with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | Each network is trained for 1.5m timesteps to ensure near-optimal convergence. The input to our network is a single RGB frame of size 84 84, which is passed through three convolutional layers and two fully-connected layers before outputting the predicted Q-values for the given state. |