Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
Authors: Hao Liang, Zhi-Quan Luo
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the empirical performance of our algorithms, we conducted numerical experiments comparing RODI-MB, RODI-MF, and RODI-Rep with the risk-neutral algorithm UCBVI (Azar et al., 2017), RSVI in Fei et al. (2020), and RSVI2 in Fei et al. (2021). The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). The results, as illustrated in Figure 1, demonstrates the regret ranking of these algorithms. |
| Researcher Affiliation | Academia | Hao Liang EMAIL School of Science and Engineering The Chinese University of Hong Kong, Shenzhen. Zhi-Quan Luo EMAIL School of Science and Engineering The Chinese University of Hong Kong, Shenzhen. |
| Pseudocode | Yes | Algorithm 1 RODI-MF. Algorithm 2 RODI-MB. Algorithm 3 ROVI. |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code for the described methodology, nor does it include any links to code repositories. |
| Open Datasets | No | The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). The paper describes a synthetic MDP environment for experiments and does not use or provide access to any external datasets. |
| Dataset Splits | No | The paper describes a synthetic MDP environment with specific parameters (S=5 states, A=5 actions, H=5 horizon) rather than using an external dataset. Therefore, the concept of training/test/validation splits is not applicable, and no such splits are provided. |
| Hardware Specification | No | The paper mentions numerical experiments but does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run these experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers) used for the implementation or experiments. |
| Experiment Setup | Yes | The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). We set δ = 0.005 and β = 1.1. |