Preference Ranking Optimization for Human Alignment
Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to Chat GPT and human responses through automaticbased, reward-based, GPT-4, and human evaluations. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2Alibaba Group |
| Pseudocode | No | The paper contains Figure 2 which is a pipeline diagram and several mathematical equations (Equation 1-9), but no clearly labeled algorithm block or pseudocode. |
| Open Source Code | Yes | More particulars can be found in our code1. 1github.com/Alibaba Research/DAMO-Conv AI/tree/main/PRO |
| Open Datasets | Yes | Data Prepration We choose HH-RLHF Bai et al. (2022a) as the experimental dataset. |
| Dataset Splits | No | Each sample contains two different conversations rated by human annotators and is grouped into train/test splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments. |
| Software Dependencies | No | We choose the popular LLa MA-7B (Touvron et al. 2023) as the backbone model, and implement PRO using Transformers (Wolf et al. 2020) and Accelerate (Gugger et al. 2022). |
| Experiment Setup | Yes | The sequence length, epoch, and learning rate are set to 512, 2, and 5e-6, respectively, while the maximum number of new tokens generated during inference is 128, and the total batch size is 112. We assign β, the weight SFT loss, to 0.05 (l 1)2 where l is the ranking length. |