reproducibilityindex.ai

Compositional Preference Models for Aligning LMs

Authors: Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Marc Dymetman

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs.
Researcher Affiliation	Collaboration	Dongyoung Go Naver Corp Yonsei University dongyoung.go@navercorp.com Tomasz Korbak University of Sussex tomasz.korbak@gmail.com Germ an Kruszewski, Jos Rozen Naver Labs Europe {german.kruszewski,jos.rozen}@naverlabs.com Marc Dymetman Independent Researcher marc.dymetman@gmail.com
Pseudocode	No	The paper describes the compositional preference model conceptually, but it does not provide any pseudocode or a formally labeled algorithm block.
Open Source Code	Yes	Code accompanying the paper is available at https://github.com/dongyoung-go/CPM
Open Datasets	Yes	We conduct experiments on two datasets, the HH-RLHF dataset (Bai et al., 2022a) and the SHP dataset (Ethayarajh et al., 2022).
Dataset Splits	Yes	We add a regularization term in logistic regression and use hyperparameters selected with 5-fold cross-validation on the training dataset.
Hardware Specification	Yes	Training was performed on Nvidia A100 GPU, with the longest run taking approximately 12 hours.
Software Dependencies	Yes	We used GPT-3.5 (gpt-3.5-turbo-0301) and Flan-T5-XL (3B parameters) (Chung et al., 2022) as a feature extractor, using the same features and prompt templates in Tab. 5 and Tab. 6. For logistic regression classifier we used Scikit-learn (Buitinck et al., 2013). All standard PMs were implemented using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020).
Experiment Setup	Yes	We conducted separate hyperparameter sweeps over learning rate and batch size for each dataset, using early-stopping based on the evaluation set with 3 steps of patience. We used a batch size of 32 and a learning rate of 1e-5 for HH-RLHF dataset and 5e-5 for SHP dataset. We used cosine learning rate schedule with 100 linear warmup steps.