Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Authors: Hao Liang, Zhi-Quan Luo

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the empirical performance of our algorithms, we conducted numerical experiments comparing RODI-MB, RODI-MF, and RODI-Rep with the risk-neutral algorithm UCBVI (Azar et al., 2017), RSVI in Fei et al. (2020), and RSVI2 in Fei et al. (2021). The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). The results, as illustrated in Figure 1, demonstrates the regret ranking of these algorithms.
Researcher Affiliation Academia Hao Liang EMAIL School of Science and Engineering The Chinese University of Hong Kong, Shenzhen. Zhi-Quan Luo EMAIL School of Science and Engineering The Chinese University of Hong Kong, Shenzhen.
Pseudocode Yes Algorithm 1 RODI-MF. Algorithm 2 RODI-MB. Algorithm 3 ROVI.
Open Source Code No The paper does not contain any explicit statement about providing open-source code for the described methodology, nor does it include any links to code repositories.
Open Datasets No The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). The paper describes a synthetic MDP environment for experiments and does not use or provide access to any external datasets.
Dataset Splits No The paper describes a synthetic MDP environment with specific parameters (S=5 states, A=5 actions, H=5 horizon) rather than using an external dataset. Therefore, the concept of training/test/validation splits is not applicable, and no such splits are provided.
Hardware Specification No The paper mentions numerical experiments but does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run these experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers) used for the implementation or experiments.
Experiment Setup Yes The experimental setup involved an MDP with S = 5 states, A = 5 actions, and a horizon H = 5, mirroring the setup in Du et al. (2022). We set δ = 0.005 and β = 1.1.