Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

Authors: Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, Marco Pavone

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our algorithms in an optimal stopping problem as well as in a realistic personalized advertisement recommendation (ad recommendation) problem (see Derfer et al. (2007) for more details). For the latter problem, we empirically show that our CVa R-constrained RL algorithms successfully guarantee that the worst-case revenue is lower-bounded by the pre-speciﬁed company yearly target. [...] Figures 1 and 2 show the distribution of the discounted cumulative cost Gθ(x0) for the policy θ learned by each of these algorithms. [...] Table 1 summarizes the performance of these algorithms.
Researcher Affiliation	Collaboration	Yinlam Chow EMAIL Deep Mind Mountain View, CA 94043, USA; Mohammad Ghavamzadeh EMAIL Deep Mind Mountain View, CA 94043, USA; Lucas Janson EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA; Marco Pavone EMAIL Aeronautics and Astronautics Stanford University Stanford, CA 94305, USA
Pseudocode	Yes	Algorithm 1 Trajectory-based Policy Gradient Algorithm for CVa R MDP; Algorithm 2 Actor-Critic Algorithms for CVa R MDP
Open Source Code	No	The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	The optimal stopping problem describes a cost sequence that is 'randomly generated by a Markov chain'. For the ad-recommendation system, it mentions using 'an Adobe personalized ad-recommendation (Theocharous and Hallak, 2013) simulator that has been trained based on real data'. Neither explicitly states public availability of a dataset or provides access information for one.
Dataset Splits	No	The paper describes generating trajectories or using a simulator for experiments, but does not mention specific training/test/validation splits for any dataset, as it does not rely on static datasets with predefined partitions.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies	No	The paper mentions concepts like 'standard Gaussian radial basis functions (RBFs)' and 'Boltzmann policies' or '3rd order Fourier basis', but does not list specific software libraries or tools with version numbers used for implementation.
Experiment Setup	Yes	We set the parameters of the MDP as follows: x0 = [1; 0], ph = 0.1, T = 20, K = 5, γ = 0.95, fu = 2, fd = 0.5, and p = 0.65. The conﬁdence level and constraint threshold are given by α = 0.95 and β = 3. The number of sample trajectories N is set to 500, 000 and the parameter bounds are λmax = 5, 000 and Θ = [ 20, 20]κ1, where the dimension of the basis functions is κ1 = 1024. [...] We set the parameters of the MDP as T = 15 and γ = 0.98, the conﬁdence level and constraint threshold as α = 0.05 and β = 0.12, the number of sample trajectories N to 1, 000, 000, and the parameter bounds as λmax = 5, 000 and Θ = [ 60, 60]κ1, where the dimension of the basis functions is κ1 = 4096.