Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Distributional Perspective on Reinforcement Learning

Authors: Marc G. Bellemare, Will Dabney, Rémi Munos

ICML 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning.
Researcher Affiliation	Industry	1Deep Mind, London, UK. Correspondence to: Marc G. Bellemare <EMAIL>.
Pseudocode	Yes	Algorithm 1 Categorical Algorithm
Open Source Code	No	The paper mentions "our TensorFlow implementation" and provides a video link, but no explicit statement of open-source code availability or a repository link for the described methodology.
Open Datasets	Yes	We applied the categorical algorithm to games from the Arcade Learning Environment (ALE; Bellemare et al., 2013). While the ALE is deterministic, stochasticity does occur in a number of guises: 1) from state aliasing, 2) learning from a nonstationary policy, and 3) from approximation errors. We used ﬁve training games (Fig 3) and 52 testing games.
Dataset Splits	No	The paper mentions "five training games" and "52 testing games" but does not provide specific numerical splits for train/validation/test sets.
Hardware Specification	No	The paper mentions "our TensorFlow implementation" but does not provide any specific hardware details like GPU/CPU models or memory specifications.
Software Dependencies	No	The paper mentions "our TensorFlow implementation" but does not specify any version numbers for TensorFlow or other software dependencies.
Experiment Setup	Yes	For our study, we use the DQN architecture (Mnih et al., 2015), but output the atom probabilities pi(x, a) instead of action-values, and chose VMAX = VMIN = 10 from preliminary experiments over the training games. We call the resulting architecture Categorical DQN. We replace the squared loss (r + γQ(x , π(x )) Q(x, a))2 by Lx,a(θ) and train the network to minimize this loss. As in DQN, we use a simple ϵ-greedy policy over the expected actionvalues; we leave as future work the many ways in which an agent could select actions on the basis of the full distribution. The rest of our training regime matches Mnih et al. s, including the use of a target network for θ. ... For this experiment, we set ϵ = 0.05. ... Speciﬁcally, we set ϵ = 0.01 (instead of 0.05); furthermore, every 1 million frames, we evaluate our agent s performance with ϵ = 0.001.