Compositional Preference Models for Aligning LMs
Authors: Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. |
| Researcher Affiliation | Collaboration | Dongyoung Go Naver Corp Yonsei University dongyoung.go@navercorp.com Tomasz Korbak University of Sussex tomasz.korbak@gmail.com Germ an Kruszewski, Jos Rozen Naver Labs Europe {german.kruszewski,jos.rozen}@naverlabs.com Marc Dymetman Independent Researcher marc.dymetman@gmail.com |
| Pseudocode | No | The paper describes the compositional preference model conceptually, but it does not provide any pseudocode or a formally labeled algorithm block. |
| Open Source Code | Yes | Code accompanying the paper is available at https://github.com/dongyoung-go/CPM |
| Open Datasets | Yes | We conduct experiments on two datasets, the HH-RLHF dataset (Bai et al., 2022a) and the SHP dataset (Ethayarajh et al., 2022). |
| Dataset Splits | Yes | We add a regularization term in logistic regression and use hyperparameters selected with 5-fold cross-validation on the training dataset. |
| Hardware Specification | Yes | Training was performed on Nvidia A100 GPU, with the longest run taking approximately 12 hours. |
| Software Dependencies | Yes | We used GPT-3.5 (gpt-3.5-turbo-0301) and Flan-T5-XL (3B parameters) (Chung et al., 2022) as a feature extractor, using the same features and prompt templates in Tab. 5 and Tab. 6. For logistic regression classifier we used Scikit-learn (Buitinck et al., 2013). All standard PMs were implemented using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020). |
| Experiment Setup | Yes | We conducted separate hyperparameter sweeps over learning rate and batch size for each dataset, using early-stopping based on the evaluation set with 3 steps of patience. We used a batch size of 32 and a learning rate of 1e-5 for HH-RLHF dataset and 5e-5 for SHP dataset. We used cosine learning rate schedule with 100 linear warmup steps. |