CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Authors: Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, Ruifeng Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that CPPO outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences.
Researcher Affiliation Academia Han Zhang1,2, Yu Lei2 , Lin Gui3, Min Yang4, Yulan He4, Hui Wang2, Ruifeng Xu1,2,5 1 Harbin Institute of Technology (Shenzhen) 2 Peng Cheng Laboratory 3 King s College London 4 Shenzhen Institutes of Advanced Technology 5 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
Pseudocode Yes Algorithm 1: CPPO
Open Source Code No Our implementation is based on the open source library trlx6. The paper uses an open-source library but does not state that its own specific implementation code for CPPO is publicly available.
Open Datasets Yes We use the human preference data provided by Carper AI5. URL: https://huggingface.co/datasets/Carper AI/openai_summarize_ comparisons
Dataset Splits Yes Table 3: The dataset is utilized for continual learning. ... Data split Train Valid Test ... Human Feedback part-1 52243 45148
Hardware Specification Yes The experiments on the SHP dataset are conducted in 4 Nvidia A100 GPUs with 80 GB of RAM, other experiments are conducted in 2 Nvidia Tesla V100 GPUs with 32 GB of RAM.
Software Dependencies No Our implementation is based on the open source library trlx6. The paper mentions a library ('trlx') and an optimizer ('adamw') but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes Table 12: Hyperparameters of different tasks. Italic font denotes the CPPO-specific hyperparameters. For all tasks, we utilize the default PPO hyperparameters released by trlx.