CPPO: Continual Learning for Reinforcement Learning with Human Feedback
Authors: Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, Ruifeng Xu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that CPPO outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences. |
| Researcher Affiliation | Academia | Han Zhang1,2, Yu Lei2 , Lin Gui3, Min Yang4, Yulan He4, Hui Wang2, Ruifeng Xu1,2,5 1 Harbin Institute of Technology (Shenzhen) 2 Peng Cheng Laboratory 3 King s College London 4 Shenzhen Institutes of Advanced Technology 5 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies |
| Pseudocode | Yes | Algorithm 1: CPPO |
| Open Source Code | No | Our implementation is based on the open source library trlx6. The paper uses an open-source library but does not state that its own specific implementation code for CPPO is publicly available. |
| Open Datasets | Yes | We use the human preference data provided by Carper AI5. URL: https://huggingface.co/datasets/Carper AI/openai_summarize_ comparisons |
| Dataset Splits | Yes | Table 3: The dataset is utilized for continual learning. ... Data split Train Valid Test ... Human Feedback part-1 52243 45148 |
| Hardware Specification | Yes | The experiments on the SHP dataset are conducted in 4 Nvidia A100 GPUs with 80 GB of RAM, other experiments are conducted in 2 Nvidia Tesla V100 GPUs with 32 GB of RAM. |
| Software Dependencies | No | Our implementation is based on the open source library trlx6. The paper mentions a library ('trlx') and an optimizer ('adamw') but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | Table 12: Hyperparameters of different tasks. Italic font denotes the CPPO-specific hyperparameters. For all tasks, we utilize the default PPO hyperparameters released by trlx. |