Compositional Preference Models for Aligning LMs

Authors: Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs.
Researcher Affiliation Collaboration Dongyoung Go Naver Corp Yonsei University dongyoung.go@navercorp.com Tomasz Korbak University of Sussex tomasz.korbak@gmail.com Germ an Kruszewski, Jos Rozen Naver Labs Europe {german.kruszewski,jos.rozen}@naverlabs.com Marc Dymetman Independent Researcher marc.dymetman@gmail.com
Pseudocode No The paper describes the compositional preference model conceptually, but it does not provide any pseudocode or a formally labeled algorithm block.
Open Source Code Yes Code accompanying the paper is available at https://github.com/dongyoung-go/CPM
Open Datasets Yes We conduct experiments on two datasets, the HH-RLHF dataset (Bai et al., 2022a) and the SHP dataset (Ethayarajh et al., 2022).
Dataset Splits Yes We add a regularization term in logistic regression and use hyperparameters selected with 5-fold cross-validation on the training dataset.
Hardware Specification Yes Training was performed on Nvidia A100 GPU, with the longest run taking approximately 12 hours.
Software Dependencies Yes We used GPT-3.5 (gpt-3.5-turbo-0301) and Flan-T5-XL (3B parameters) (Chung et al., 2022) as a feature extractor, using the same features and prompt templates in Tab. 5 and Tab. 6. For logistic regression classifier we used Scikit-learn (Buitinck et al., 2013). All standard PMs were implemented using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020).
Experiment Setup Yes We conducted separate hyperparameter sweeps over learning rate and batch size for each dataset, using early-stopping based on the evaluation set with 3 steps of patience. We used a batch size of 32 and a learning rate of 1e-5 for HH-RLHF dataset and 5e-5 for SHP dataset. We used cosine learning rate schedule with 100 linear warmup steps.