Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Risk-Constrained Reinforcement Learning with Percentile Risk Criteria
Authors: Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, Marco Pavone
JMLR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our algorithms in an optimal stopping problem as well as in a realistic personalized advertisement recommendation (ad recommendation) problem (see Derfer et al. (2007) for more details). For the latter problem, we empirically show that our CVa R-constrained RL algorithms successfully guarantee that the worst-case revenue is lower-bounded by the pre-specified company yearly target. [...] Figures 1 and 2 show the distribution of the discounted cumulative cost Gθ(x0) for the policy θ learned by each of these algorithms. [...] Table 1 summarizes the performance of these algorithms. |
| Researcher Affiliation | Collaboration | Yinlam Chow EMAIL Deep Mind Mountain View, CA 94043, USA; Mohammad Ghavamzadeh EMAIL Deep Mind Mountain View, CA 94043, USA; Lucas Janson EMAIL Department of Statistics Stanford University Stanford, CA 94305, USA; Marco Pavone EMAIL Aeronautics and Astronautics Stanford University Stanford, CA 94305, USA |
| Pseudocode | Yes | Algorithm 1 Trajectory-based Policy Gradient Algorithm for CVa R MDP; Algorithm 2 Actor-Critic Algorithms for CVa R MDP |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The optimal stopping problem describes a cost sequence that is 'randomly generated by a Markov chain'. For the ad-recommendation system, it mentions using 'an Adobe personalized ad-recommendation (Theocharous and Hallak, 2013) simulator that has been trained based on real data'. Neither explicitly states public availability of a dataset or provides access information for one. |
| Dataset Splits | No | The paper describes generating trajectories or using a simulator for experiments, but does not mention specific training/test/validation splits for any dataset, as it does not rely on static datasets with predefined partitions. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions concepts like 'standard Gaussian radial basis functions (RBFs)' and 'Boltzmann policies' or '3rd order Fourier basis', but does not list specific software libraries or tools with version numbers used for implementation. |
| Experiment Setup | Yes | We set the parameters of the MDP as follows: x0 = [1; 0], ph = 0.1, T = 20, K = 5, γ = 0.95, fu = 2, fd = 0.5, and p = 0.65. The confidence level and constraint threshold are given by α = 0.95 and β = 3. The number of sample trajectories N is set to 500, 000 and the parameter bounds are λmax = 5, 000 and Θ = [ 20, 20]κ1, where the dimension of the basis functions is κ1 = 1024. [...] We set the parameters of the MDP as T = 15 and γ = 0.98, the confidence level and constraint threshold as α = 0.05 and β = 0.12, the number of sample trajectories N to 1, 000, 000, and the parameter bounds as λmax = 5, 000 and Θ = [ 60, 60]κ1, where the dimension of the basis functions is κ1 = 4096. |