Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs
Authors: Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. |
| Researcher Affiliation | Collaboration | 1Sea AI Lab, Singapore 2School of Computing and Information Systems, Singapore Management University |
| Pseudocode | No | The paper describes the method conceptually and with flowcharts (Figure 2), but it does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/sail-sg/CPO. |
| Open Datasets | Yes | For QA, we conduct experiments on three widely used datasets: Bamboogle [17], Wiki Multi Hop QA [52], and Hotpot QA [53]. For fact verification, we use three datasets: Fever [54], Feverous [55], and Vitaminc [56]. For arithmetic reasoning, we test on the SVAMP dataset [57]. |
| Dataset Splits | Yes | We train the LLMs for 4 epochs with early stopping based on the performance on a randomly sampled validation set. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA A100 GPUs. The latency reported in Table 1 is based on a single NVIDIA A100 40GB. |
| Software Dependencies | No | For efficient fine-tuning, we use Low-Rank Adaptation (Lo RA) adapters [58]. We use a batch size (with accumulation) of 32 and optimize the LLM with Adam W [59]. Both training and inference are performed using the Accelerate [60] backend. |
| Experiment Setup | Yes | In all experiments, we set the regularization controller β to 0.1, generate 10 new thoughts for each state, and retain the top 5 thoughts after pruning at each step of reasoning. The temperature is set to 0.9 for SVAMP and 0.4 for the other datasets. The learning rates for DPO and SFT are 5e-6 and 1e-5, respectively. We use a batch size (with accumulation) of 32 and optimize the LLM with Adam W [59]. For Lo RA, the rank is set to 8, and α is set to 16. |