Preference Ranking Optimization for Human Alignment

Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to Chat GPT and human responses through automaticbased, reward-based, GPT-4, and human evaluations.
Researcher Affiliation Collaboration 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2Alibaba Group
Pseudocode No The paper contains Figure 2 which is a pipeline diagram and several mathematical equations (Equation 1-9), but no clearly labeled algorithm block or pseudocode.
Open Source Code Yes More particulars can be found in our code1. 1github.com/Alibaba Research/DAMO-Conv AI/tree/main/PRO
Open Datasets Yes Data Prepration We choose HH-RLHF Bai et al. (2022a) as the experimental dataset.
Dataset Splits No Each sample contains two different conversations rated by human annotators and is grouped into train/test splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies No We choose the popular LLa MA-7B (Touvron et al. 2023) as the backbone model, and implement PRO using Transformers (Wolf et al. 2020) and Accelerate (Gugger et al. 2022).
Experiment Setup Yes The sequence length, epoch, and learning rate are set to 512, 2, and 5e-6, respectively, while the maximum number of new tokens generated during inference is 128, and the total batch size is 112. We assign β, the weight SFT loss, to 0.05 (l 1)2 where l is the ranking length.