reproducibilityindex.ai

Preference Ranking Optimization for Human Alignment

Authors: Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to Chat GPT and human responses through automaticbased, reward-based, GPT-4, and human evaluations.
Researcher Affiliation	Collaboration	1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2Alibaba Group
Pseudocode	No	The paper contains Figure 2 which is a pipeline diagram and several mathematical equations (Equation 1-9), but no clearly labeled algorithm block or pseudocode.
Open Source Code	Yes	More particulars can be found in our code1. 1github.com/Alibaba Research/DAMO-Conv AI/tree/main/PRO
Open Datasets	Yes	Data Prepration We choose HH-RLHF Bai et al. (2022a) as the experimental dataset.
Dataset Splits	No	Each sample contains two different conversations rated by human annotators and is grouped into train/test splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies	No	We choose the popular LLa MA-7B (Touvron et al. 2023) as the backbone model, and implement PRO using Transformers (Wolf et al. 2020) and Accelerate (Gugger et al. 2022).
Experiment Setup	Yes	The sequence length, epoch, and learning rate are set to 512, 2, and 5e-6, respectively, while the maximum number of new tokens generated during inference is 128, and the total batch size is 112. We assign β, the weight SFT loss, to 0.05 (l 1)2 where l is the ranking length.