Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Metrics and Continuity in Reinforcement Learning

Authors: Charline Le Lan, Marc G. Bellemare, Pablo Samuel Castro8261-8269

AAAI 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We complement our theoretical results with empirical evaluations showcasing the differences between the metrics considered.
Researcher Affiliation	Collaboration	Charline Le Lan, * 1 Marc G. Bellemare, 2 Pablo Samuel Castro2 1University of Oxford, 2Google Research, Brain Team EMAIL, EMAIL
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code used to produce all these experiments is open-sourced 2. 2Code available at https://github.com/google-research/google-research/tree/master/rl_metrics_aaai2021
Open Datasets	No	We conduct our experiments on Garnet MDPs, which are a class of randomly generated MDPs (Archibald, Mc Kinnon, and Thomas 1995; Piot, Geist, and Pietquin 2014). Speciﬁcally, a Garnet MDP Garnet(n S, n A) is parameterized by two values: the number of states n S and the number of actions n A, and is generated as follows: 1. The branching factor bs,a of each transition Pa s is sampled uniformly from [1 : n S]. 2. bs,a states are picked uniformly randomly from S and assigned a random value in [0, 1]; these values are then normalized to produce a proper distribution Pa s . 3. Each Ra s is sampled uniformly in [0, 1].
Dataset Splits	No	The paper describes subsampling of states for evaluation but does not provide specific training/test/validation dataset splits needed for reproducibility in the traditional sense.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions software packages such as NumPy, TensorFlow, SciPy, Matplotlib, and Gin-Config, but does not provide specific version numbers for them.
Experiment Setup	Yes	We conduct our experiments on Garnet MDPs... Averaged over 100 Garnet MDPs with 200 states and 5 actions, with 50 independent runs for each... For each metric, we perform 10 different aggregations using a k-median algorithm, ranging from one aggregate state to 200 aggregate states.Speciﬁcally, given a subsampling fraction f [0, 1], we sample \|S\| f states and call this set κ.