Meta-Reward-Net: Implicitly Differentiable Reward Learning for Preference-based Reinforcement Learning
Authors: Runze Liu, Fengshuo Bai, Yali Du, Yaodong Yang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on robotic simulated manipulation tasks and locomotion tasks demonstrate that MRN outperforms prior methods in the case of few preference labels and significantly improves data efficiency, achieving state-of-the-art in preference-based RL. Ablation studies further demonstrate that MRN learns a more accurate Q-function compared to prior work and shows obvious advantages when only a small amount of human feedback is available. |
| Researcher Affiliation | Academia | Runze Liu1,2, Fengshuo Bai3, Yali Du4, , Yaodong Yang1,5, 1Institute for AI, Peking University, 2Shandong University 3Institute of Automation, Chinese Academy of Science 4King s College London, 5Beijing Institute for General AI |
| Pseudocode | No | The paper provides a high-level framework illustration (Figure 1) and describes the algorithm procedure in text within Section 4.2 and Appendix A, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | The source code and videos of this project are released at https://sites.google.com/view/meta-reward-net1. |
| Open Datasets | Yes | In this section, our method is evaluated on a variety of robotic simulated manipulation tasks from Meta-world [21] and locomotion tasks from Deep Mind Control Suite (DMControl) [22, 23]. |
| Dataset Splits | No | The paper describes the amount of human preference feedback used for different tasks (e.g., '100 for Walker', '10000 for Hammer') and mentions running experiments multiple times. However, it does not provide explicit training, validation, and test splits for the interaction data or trajectories generated by the reinforcement learning agent, nor does it reference standard splits for the environments. |
| Hardware Specification | Yes | The experiments are run on a single machine with one NVIDIA RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions using publicly released repositories for baselines (B-Pref [58], SURF [18]) and implementing their method using PEBBLE as the backbone. However, it does not provide specific version numbers for software dependencies or libraries like Python, PyTorch, or other relevant packages. |
| Experiment Setup | Yes | Details on hyperparameters, network architectures can be found in Appendix E. |