Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STAR: Efficient Preference-based Reinforcement Learning via Dual Regularization
Authors: Fengshuo Bai, Rui Zhao, Hongming Zhang, Sijia Cui, Shao Zhang, bo xu, Lei Han, Ying Wen, Yaodong Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that STAR improves feedback efficiency, achieving 34.8% higher performance in online settings and 29.7% in offline settings compared to stateof-the-art methods. Ablation studies confirm that STAR facilitates more robust reward and value function learning. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2PKU-Psi Bot Joint Lab 3Zhongguancun Academy 4Tencent 5National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institution of Automation, Chinese Academy of Sciences 6Institute for AI, Peking University |
| Pseudocode | Yes | The detailed procedures for both the online and offline settings are outlined in Algorithms 1 and 2 in Appendix A. Algorithm 1 STAR (Online) ... Algorithm 2 STAR (Offline) |
| Open Source Code | Yes | The videos of this project are released at https://sites.google.com/view/pbrl-star. ... We include source code in supplementary material. |
| Open Datasets | Yes | For the offline setting, we include eight challenging control tasks from D4RL [7] and four robotic manipulation tasks from Robosuite [74]. ... For offline experiments, we use real human preference data from Kim et al. [21]. |
| Dataset Splits | No | The number of preference pairs varies by task complexity: 100 pairs for tasks such as Cheetah Run, Button Press, and Window Open; 500 pairs for tasks like Hopper-medium-replay-v2; and up to 4000 pairs for more complex tasks like Sweep Into. The number of preference labels used per task is detailed in Appendix C.2. |
| Hardware Specification | No | Each run is conducted using the exact same hardware environment. |
| Software Dependencies | No | In our experiments, we follow the basic setup employed by prior work [24, 37, 28], which includes unsupervised exploration and an uncertainty-based trajectory sampling strategy. ... The actor in SAC consists of two layers with 1024 hidden units. ... The detailed neural network parameters and hyperparameters for SAC are shown in Table 8a. Additionally, Table 8b presents the distinct hyperparameters for PEBBLE and STAR. |
| Experiment Setup | Yes | C.3 Architecture and hyperparameters. In this section, we describe the architecture of the neural networks used in the SAC algorithm, which serves as the baseline method. ... The detailed neural network parameters and hyperparameters for SAC are shown in Table 8a. Additionally, Table 8b presents the distinct hyperparameters for PEBBLE and STAR. |