Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Authors: Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness.
Researcher Affiliation Collaboration 1Sea AI Lab, Singapore 2School of Computing and Information Systems, Singapore Management University
Pseudocode No The paper describes the method conceptually and with flowcharts (Figure 2), but it does not provide any pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/sail-sg/CPO.
Open Datasets Yes For QA, we conduct experiments on three widely used datasets: Bamboogle [17], Wiki Multi Hop QA [52], and Hotpot QA [53]. For fact verification, we use three datasets: Fever [54], Feverous [55], and Vitaminc [56]. For arithmetic reasoning, we test on the SVAMP dataset [57].
Dataset Splits Yes We train the LLMs for 4 epochs with early stopping based on the performance on a randomly sampled validation set.
Hardware Specification Yes All experiments are conducted on NVIDIA A100 GPUs. The latency reported in Table 1 is based on a single NVIDIA A100 40GB.
Software Dependencies No For efficient fine-tuning, we use Low-Rank Adaptation (Lo RA) adapters [58]. We use a batch size (with accumulation) of 32 and optimize the LLM with Adam W [59]. Both training and inference are performed using the Accelerate [60] backend.
Experiment Setup Yes In all experiments, we set the regularization controller β to 0.1, generate 10 new thoughts for each state, and retain the top 5 thoughts after pruning at each step of reasoning. The temperature is set to 0.9 for SVAMP and 0.4 for the other datasets. The learning rates for DPO and SFT are 5e-6 and 1e-5, respectively. We use a batch size (with accumulation) of 32 and optimize the LLM with Adam W [59]. For Lo RA, the rank is set to 8, and α is set to 16.