Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
Authors: Hao Liang, Zhi-Quan Luo
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the empirical performance of our algorithms, we conducted numerical experiments comparing RODI-MB, RODI-MF, and RODI-Rep with the risk-neutral algorithm UCBVI (Azar et al., 2017), RSVI in Fei et al. (2020), and RSVI2 in Fei et al. (2021). The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). The results, as illustrated in Figure 1, demonstrates the regret ranking of these algorithms. |
| Researcher Affiliation | Academia | Hao Liang EMAIL School of Science and Engineering The Chinese University of Hong Kong, Shenzhen. Zhi-Quan Luo EMAIL School of Science and Engineering The Chinese University of Hong Kong, Shenzhen. |
| Pseudocode | Yes | Algorithm 1 RODI-MF. Algorithm 2 RODI-MB. Algorithm 3 ROVI. |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code for the described methodology, nor does it include any links to code repositories. |
| Open Datasets | No | The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). The paper describes a synthetic MDP environment for experiments and does not use or provide access to any external datasets. |
| Dataset Splits | No | The paper describes a synthetic MDP environment with specific parameters (S=5 states, A=5 actions, H=5 horizon) rather than using an external dataset. Therefore, the concept of training/test/validation splits is not applicable, and no such splits are provided. |
| Hardware Specification | No | The paper mentions numerical experiments but does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run these experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers) used for the implementation or experiments. |
| Experiment Setup | Yes | The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). We set δ = 0.005 and β = 1.1. |