Composing Value Functions in Reinforcement Learning

Authors: Benjamin Van Niekerk, Steven James, Adam Earle, Benjamin Rosman

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate composition, we perform a series of experiments in a high-dimensional video game (Figure 1b).Results show that an agent is able to compose existing policies learned from high-dimensional pixel input to generate new, optimal behaviours.
Researcher Affiliation Academia 1School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa 2Council for Scientific and Industrial Research, Pretoria, South Africa.
Pseudocode Yes Algorithm 1 Soft Value Iteration, Algorithm 2 Soft Policy Iteration
Open Source Code No The paper does not provide a link to source code or explicitly state that source code for the methodology is being released.
Open Datasets No The paper describes custom tasks within a video game domain developed for the experiments, but does not provide access information for a publicly available dataset. It states 'We construct a number of different tasks based on the objects that the agent must collect'.
Dataset Splits No The paper describes training and evaluation on a custom video game environment (e.g., 'Each network is trained for 1.5m timesteps', 'Returns from 50k episodes'), but does not specify explicit training, validation, or test dataset splits.
Hardware Specification No No specific hardware details (such as GPU or CPU models, or cloud computing specifications) used for running experiments were mentioned.
Software Dependencies No The paper mentions 'soft) deep Q-learning' but does not specify any software names with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes Each network is trained for 1.5m timesteps to ensure near-optimal convergence. The input to our network is a single RGB frame of size 84 84, which is passed through three convolutional layers and two fully-connected layers before outputting the predicted Q-values for the given state.