Distributional Preference Alignment of LLMs via Optimal Transport
Authors: Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jarret Ross
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and Alpaca Eval. |
| Researcher Affiliation | Collaboration | Igor Melnyk , Youssef Mroueh , Brian Belgodere , Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, and Jarret Ross IBM Research MIT-IBM Watson AI Lab |
| Pseudocode | Yes | Algorithms 1 and 2 in Appendix B summarize our AOT approach for distributional preference alignment in the unpaired and paired setting. |
| Open Source Code | Yes | Code for AOT is available in the Hugging Face TRL library https://ibm.biz/AOT_TRL. |
| Open Datasets | Yes | For the paired dataset, we used the Ultra Feedback binarized dataset from [Tunstall et al., 2023b], containing over 60K training samples... For unpaired datasets, we used PKU Beaver Tails [Ji et al., 2023] with over 300K samples and Help Steer [Wang et al., 2023] with around 35K samples. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, only mentions training samples and evaluation on test sets like Alpaca Eval and Open LLM benchmarks. |
| Hardware Specification | Yes | For each run our compute setup consisted of 8 H100 GPUs. |
| Software Dependencies | No | Our implementation is based on the Hugging Face Alignment Handbook Tunstall et al. [2023a]. As we show in Appendix in Section B, the changes needed to adapt HF TRL trainer [von Werra et al., 2020] for AOT are minimal and therefore can easily be adapted by the community. No specific version numbers for these software dependencies are provided in the paper text. |
| Experiment Setup | Yes | We used Lo RA [Hu et al., 2021] for parameter-efficient fine-tuning during alignment and the FSDP (Fully-Sharded Data-Parallel) setup to train the model over multiple GPUs. Under this setup, the training of each 7B-parameter model on the Ultra Feedback dataset took approximately one hour. The batch size is the effective number of samples in the mini-batch per GPU. We found the logistic loss to be performing better than least squared or hinge squared losses (all using β = 0.01). |