RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning
Authors: Yujie Zhao, Jose Aguilar Escamilla, Weyl Lu, Huazheng Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Additionally, we provide a theoretical analysis of the regret upper bounds, demonstrating that they are sublinear with respect to the number of episodes, and present empirical results to support our findings. |
| Researcher Affiliation | Academia | Yujie Zhao1, Jose Efraim Aguilar Escamill2, Weyl Lu3, Huazheng Wang2 1 University of California, San Diego, 2 Oregon State University, 3 University of California, Davis yuz285@ucsd.edu, aguijose@oregonstate.edu, adslu@ucdavis.edu, huazheng.wang@oregonstate.edu |
| Pseudocode | Yes | RA-Pb RL is formally described in Algorithm 1. The development of RA-Pb RL is primarily inspired by the Pb OP algorithm, as delineated in Chen et al. [2022], which was originally proposed for risk-neutral Pb RL environments. |
| Open Source Code | Yes | Our code is available in https://github.com/aguilarjose11/Pb RLNeurips. |
| Open Datasets | No | The paper describes experimental settings involving a 'straightforward tabular MDP' configured by the authors and 'Mu Jo Co s Half-cheetah simulation'. While MuJoCo is a known environment, the paper does not specify the use of a publicly available dataset with concrete access information (URL, DOI, or specific citation) that would reproduce the exact data used for training. Instead, the data is generated within these environments. |
| Dataset Splits | No | The paper does not specify training, validation, and test splits for any dataset. It mentions '50 independent trials' or '100 episodes' for evaluation, but not data splits for model training/validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU/CPU models, memory specifications, or types of computing instances. It mentions running simulations (e.g., MuJoCo's Half-cheetah) but not the machines on which these simulations were executed. |
| Software Dependencies | No | The paper does not provide a reproducible description of ancillary software with specific version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions). |
| Experiment Setup | Yes | In our experimental framework, we configure a straightforward tabular MDP characterized by finite steps H = 6, finite actions A = 3, state space S = 4, and risk levels α {0.05, 0.10, 0.20, 0.40}. For each configuration and algorithms, we perform 50 independent trials and report the mean regret across these trials, along with 95% confidence intervals. |