Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
Authors: Zhengran Ji, Boyuan Chen
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Pref-GUIDE across three challenging visual RL environments Zhang et al. (2024a;b), where agents must act based on partial visual observations. Our results show that Pref-GUIDE Individual outperforms regression-based baselines when feedback quality is high, and Pref-GUIDE Voting maintains strong performance across diverse user inputs. In more complex tasks, our method even surpasses policies trained with expert-designed dense rewards, demonstrating the power of structured and population preference learning from real-time human feedback. Extensive ablation studies further validate our design choices and provide insight into their contributions. |
| Researcher Affiliation | Academia | Zhengran Ji EMAIL Duke University Boyuan Chen EMAIL Duke University |
| Pseudocode | Yes | The algorithm of Pref-GUIDE Individual is summarized in Algorithm 1. The algorithm of Pref-GUIDE Voting is summarized in Algorithm 2. |
| Open Source Code | No | The paper provides a "Project Website: http://generalroboticslab.com/Pref-GUIDE" which is a project demonstration page. It does not explicitly state that the source code for the methodology described in this paper is openly available through this link or any other repository, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Our dataset Dreal-time comes from GUIDE Zhang et al. (2024b), which contains interactions from 50 human evaluators across three challenging visual RL environments: Bowling, Find Treasure, and Hide and Seek 1v1. |
| Dataset Splits | No | The paper describes the duration of the human-in-the-loop phase (5 minutes for Bowling, 10 minutes for Find Treasure and Hide and Seek 1v1) and the post-human-guidance phase (15 minutes for Bowling, 50 minutes for the other two tasks). However, it does not specify explicit training/test/validation dataset splits (e.g., percentages or sample counts) for the collected data. |
| Hardware Specification | Yes | We conducted the experiments using NVIDIA A100 and NVIDIA RTX A6000. However, our experiment can be run on a single desktop with a single GPU, such as NVIDIA RTX 4070. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or solvers. It only mentions the base reinforcement learning algorithm DDPG Lillicrap et al. (2015) as the RL backbone, but no version. |
| Experiment Setup | Yes | Each run consisted of a real-time human-in-the-loop phase, followed by a post-human-guidance phase for continual learning without human input. The human-in-the-loop phase lasted 5 minutes for Bowling and 10 minutes for Find Treasure and Hide and Seek 1v1. After this, agents continued to train using learned reward models for 15 minutes for Bowling and 50 minutes for the other two tasks. To accurately track policy performance over time, we saved the policies at regular intervals for performance evaluation: every 1 minute for Bowling, and every 2 minutes for the other two tasks. We define a window Wi containing n consecutive trajectory-feedback pairs. In practice, we set n = 10, which corresponds to 5 seconds of human guidance in our environments. We introduce a no-preference threshold δ, set to 5% of the total feedback range. |