Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Composing Value Functions in Reinforcement Learning
Authors: Benjamin Van Niekerk, Steven James, Adam Earle, Benjamin Rosman
ICML 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate composition, we perform a series of experiments in a high-dimensional video game (Figure 1b).Results show that an agent is able to compose existing policies learned from high-dimensional pixel input to generate new, optimal behaviours. |
| Researcher Affiliation | Academia | 1School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa 2Council for Scienti๏ฌc and Industrial Research, Pretoria, South Africa. |
| Pseudocode | Yes | Algorithm 1 Soft Value Iteration, Algorithm 2 Soft Policy Iteration |
| Open Source Code | No | The paper does not provide a link to source code or explicitly state that source code for the methodology is being released. |
| Open Datasets | No | The paper describes custom tasks within a video game domain developed for the experiments, but does not provide access information for a publicly available dataset. It states 'We construct a number of different tasks based on the objects that the agent must collect'. |
| Dataset Splits | No | The paper describes training and evaluation on a custom video game environment (e.g., 'Each network is trained for 1.5m timesteps', 'Returns from 50k episodes'), but does not specify explicit training, validation, or test dataset splits. |
| Hardware Specification | No | No specific hardware details (such as GPU or CPU models, or cloud computing specifications) used for running experiments were mentioned. |
| Software Dependencies | No | The paper mentions 'soft) deep Q-learning' but does not specify any software names with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | Each network is trained for 1.5m timesteps to ensure near-optimal convergence. The input to our network is a single RGB frame of size 84 84, which is passed through three convolutional layers and two fully-connected layers before outputting the predicted Q-values for the given state. |