reproducibilityindex.ai

CPPO: Continual Learning for Reinforcement Learning with Human Feedback

Authors: Han Zhang, Yu Lei, Lin Gui, Min Yang, Yulan He, Hui Wang, Ruifeng Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that CPPO outperforms strong Continuous learning (CL) baselines when it comes to consistently aligning with human preferences.
Researcher Affiliation	Academia	Han Zhang1,2, Yu Lei2 , Lin Gui3, Min Yang4, Yulan He4, Hui Wang2, Ruifeng Xu1,2,5 1 Harbin Institute of Technology (Shenzhen) 2 Peng Cheng Laboratory 3 King s College London 4 Shenzhen Institutes of Advanced Technology 5 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
Pseudocode	Yes	Algorithm 1: CPPO
Open Source Code	No	Our implementation is based on the open source library trlx6. The paper uses an open-source library but does not state that its own specific implementation code for CPPO is publicly available.
Open Datasets	Yes	We use the human preference data provided by Carper AI5. URL: https://huggingface.co/datasets/Carper AI/openai_summarize_ comparisons
Dataset Splits	Yes	Table 3: The dataset is utilized for continual learning. ... Data split Train Valid Test ... Human Feedback part-1 52243 45148
Hardware Specification	Yes	The experiments on the SHP dataset are conducted in 4 Nvidia A100 GPUs with 80 GB of RAM, other experiments are conducted in 2 Nvidia Tesla V100 GPUs with 32 GB of RAM.
Software Dependencies	No	Our implementation is based on the open source library trlx6. The paper mentions a library ('trlx') and an optimizer ('adamw') but does not specify their version numbers or other software dependencies with versions.
Experiment Setup	Yes	Table 12: Hyperparameters of different tasks. Italic font denotes the CPPO-specific hyperparameters. For all tasks, we utilize the default PPO hyperparameters released by trlx.