Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CPPO: Continual Learning for Reinforcement Learning with Human Feedback
Authors: Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, Ruifeng Xu
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that CPPO outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences. |
| Researcher Affiliation | Academia | Han Zhang1,2, Yu Lei2 , Lin Gui3, Min Yang4, Yulan He4, Hui Wang2, Ruifeng Xu1,2,5 1 Harbin Institute of Technology (Shenzhen) 2 Peng Cheng Laboratory 3 King s College London 4 Shenzhen Institutes of Advanced Technology 5 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies |
| Pseudocode | Yes | Algorithm 1: CPPO |
| Open Source Code | No | Our implementation is based on the open source library trlx6. The paper uses an open-source library but does not state that its own specific implementation code for CPPO is publicly available. |
| Open Datasets | Yes | We use the human preference data provided by Carper AI5. URL: https://huggingface.co/datasets/Carper AI/openai_summarize_ comparisons |
| Dataset Splits | Yes | Table 3: The dataset is utilized for continual learning. ... Data split Train Valid Test ... Human Feedback part-1 52243 45148 |
| Hardware Specification | Yes | The experiments on the SHP dataset are conducted in 4 Nvidia A100 GPUs with 80 GB of RAM, other experiments are conducted in 2 Nvidia Tesla V100 GPUs with 32 GB of RAM. |
| Software Dependencies | No | Our implementation is based on the open source library trlx6. The paper mentions a library ('trlx') and an optimizer ('adamw') but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | Table 12: Hyperparameters of different tasks. Italic font denotes the CPPO-specific hyperparameters. For all tasks, we utilize the default PPO hyperparameters released by trlx. |