Distributional Preference Alignment of LLMs via Optimal Transport

Authors: Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jarret Ross

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and Alpaca Eval.
Researcher Affiliation Collaboration Igor Melnyk , Youssef Mroueh , Brian Belgodere , Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jarret Ross IBM Research MIT-IBM Watson AI Lab
Pseudocode Yes Algorithms 1 and 2 in Appendix B summarize our AOT approach for distributional preference alignment in the unpaired and paired setting.
Open Source Code Yes Code for AOT is available in the Hugging Face TRL library https://ibm.biz/AOT_TRL.
Open Datasets Yes For the paired dataset, we used the Ultra Feedback binarized dataset from [Tunstall et al., 2023b], containing over 60K training samples... For unpaired datasets, we used PKU Beaver Tails [Ji et al., 2023] with over 300K samples and Help Steer [Wang et al., 2023] with around 35K samples.
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, only mentions training samples and evaluation on test sets like Alpaca Eval and Open LLM benchmarks.
Hardware Specification Yes For each run our compute setup consisted of 8 H100 GPUs.
Software Dependencies No Our implementation is based on the Hugging Face Alignment Handbook Tunstall et al. [2023a]. As we show in Appendix in Section B, the changes needed to adapt HF TRL trainer [von Werra et al., 2020] for AOT are minimal and therefore can easily be adapted by the community. No specific version numbers for these software dependencies are provided in the paper text.
Experiment Setup Yes We used Lo RA [Hu et al., 2021] for parameter-efficient fine-tuning during alignment and the FSDP (Fully-Sharded Data-Parallel) setup to train the model over multiple GPUs. Under this setup, the training of each 7B-parameter model on the Ultra Feedback dataset took approximately one hour. The batch size is the effective number of samples in the mini-batch per GPU. We found the logistic loss to be performing better than least squared or hinge squared losses (all using β = 0.01).