reproducibilityindex.ai

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Authors: Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness.
Researcher Affiliation	Collaboration	1Sea AI Lab, Singapore 2School of Computing and Information Systems, Singapore Management University
Pseudocode	No	The paper describes the method conceptually and with flowcharts (Figure 2), but it does not provide any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/sail-sg/CPO.
Open Datasets	Yes	For QA, we conduct experiments on three widely used datasets: Bamboogle [17], Wiki Multi Hop QA [52], and Hotpot QA [53]. For fact verification, we use three datasets: Fever [54], Feverous [55], and Vitaminc [56]. For arithmetic reasoning, we test on the SVAMP dataset [57].
Dataset Splits	Yes	We train the LLMs for 4 epochs with early stopping based on the performance on a randomly sampled validation set.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 GPUs. The latency reported in Table 1 is based on a single NVIDIA A100 40GB.
Software Dependencies	No	For efficient fine-tuning, we use Low-Rank Adaptation (Lo RA) adapters [58]. We use a batch size (with accumulation) of 32 and optimize the LLM with Adam W [59]. Both training and inference are performed using the Accelerate [60] backend.
Experiment Setup	Yes	In all experiments, we set the regularization controller β to 0.1, generate 10 new thoughts for each state, and retain the top 5 thoughts after pruning at each step of reasoning. The temperature is set to 0.9 for SVAMP and 0.4 for the other datasets. The learning rates for DPO and SFT are 5e-6 and 1e-5, respectively. We use a batch size (with accumulation) of 32 and optimize the LLM with Adam W [59]. For Lo RA, the rank is set to 8, and α is set to 16.