RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning

Authors: Yujie Zhao, Jose Aguilar Escamilla, Weyl Lu, Huazheng Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Additionally, we provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results to support our findings.
Researcher Affiliation Academia Yujie Zhao1, Jose Efraim Aguilar Escamill2, Weyl Lu3, Huazheng Wang2 1 University of California, San Diego, 2 Oregon State University, 3 University of California, Davis yuz285@ucsd.edu, aguijose@oregonstate.edu, adslu@ucdavis.edu, huazheng.wang@oregonstate.edu
Pseudocode Yes RA-Pb RL is formally described in Algorithm 1. The development of RA-Pb RL is primarily inspired by the Pb OP algorithm, as delineated in Chen et al. [2022], which was originally proposed for risk-neutral Pb RL environments.
Open Source Code Yes Our code is available in https://github.com/aguilarjose11/Pb RLNeurips.
Open Datasets No The paper describes experimental settings involving a 'straightforward tabular MDP' configured by the authors and 'Mu Jo Co s Half-cheetah simulation'. While MuJoCo is a known environment, the paper does not specify the use of a publicly available dataset with concrete access information (URL, DOI, or specific citation) that would reproduce the exact data used for training. Instead, the data is generated within these environments.
Dataset Splits No The paper does not specify training, validation, and test splits for any dataset. It mentions '50 independent trials' or '100 episodes' for evaluation, but not data splits for model training/validation.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU/CPU models, memory specifications, or types of computing instances. It mentions running simulations (e.g., MuJoCo's Half-cheetah) but not the machines on which these simulations were executed.
Software Dependencies No The paper does not provide a reproducible description of ancillary software with specific version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes In our experimental framework, we configure a straightforward tabular MDP characterized by finite steps H = 6, finite actions A = 3, state space S = 4, and risk levels α {0.05, 0.10, 0.20, 0.40}. For each configuration and algorithms, we perform 50 independent trials and report the mean regret across these trials, along with 95% confidence intervals.