Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Authors: Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, Ruifeng Xu

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that CPPO outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences.
Researcher Affiliation Academia Han Zhang1,2, Yu Lei2 , Lin Gui3, Min Yang4, Yulan He4, Hui Wang2, Ruifeng Xu1,2,5 1 Harbin Institute of Technology (Shenzhen) 2 Peng Cheng Laboratory 3 King s College London 4 Shenzhen Institutes of Advanced Technology 5 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
Pseudocode Yes Algorithm 1: CPPO
Open Source Code No Our implementation is based on the open source library trlx6. The paper uses an open-source library but does not state that its own specific implementation code for CPPO is publicly available.
Open Datasets Yes We use the human preference data provided by Carper AI5. URL: https://huggingface.co/datasets/Carper AI/openai_summarize_ comparisons
Dataset Splits Yes Table 3: The dataset is utilized for continual learning. ... Data split Train Valid Test ... Human Feedback part-1 52243 45148
Hardware Specification Yes The experiments on the SHP dataset are conducted in 4 Nvidia A100 GPUs with 80 GB of RAM, other experiments are conducted in 2 Nvidia Tesla V100 GPUs with 32 GB of RAM.
Software Dependencies No Our implementation is based on the open source library trlx6. The paper mentions a library ('trlx') and an optimizer ('adamw') but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes Table 12: Hyperparameters of different tasks. Italic font denotes the CPPO-specific hyperparameters. For all tasks, we utilize the default PPO hyperparameters released by trlx.