Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Compositional Preference Models for Aligning LMs
Authors: Dongyoung Go, Tomasz Korbak, Germรกn Kruszewski, Jos Rozen, Marc Dymetman
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. |
| Researcher Affiliation | Collaboration | Dongyoung Go Naver Corp Yonsei University EMAIL Tomasz Korbak University of Sussex EMAIL Germ an Kruszewski, Jos Rozen Naver Labs Europe EMAIL Marc Dymetman Independent Researcher EMAIL |
| Pseudocode | No | The paper describes the compositional preference model conceptually, but it does not provide any pseudocode or a formally labeled algorithm block. |
| Open Source Code | Yes | Code accompanying the paper is available at https://github.com/dongyoung-go/CPM |
| Open Datasets | Yes | We conduct experiments on two datasets, the HH-RLHF dataset (Bai et al., 2022a) and the SHP dataset (Ethayarajh et al., 2022). |
| Dataset Splits | Yes | We add a regularization term in logistic regression and use hyperparameters selected with 5-fold cross-validation on the training dataset. |
| Hardware Specification | Yes | Training was performed on Nvidia A100 GPU, with the longest run taking approximately 12 hours. |
| Software Dependencies | Yes | We used GPT-3.5 (gpt-3.5-turbo-0301) and Flan-T5-XL (3B parameters) (Chung et al., 2022) as a feature extractor, using the same features and prompt templates in Tab. 5 and Tab. 6. For logistic regression classifier we used Scikit-learn (Buitinck et al., 2013). All standard PMs were implemented using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020). |
| Experiment Setup | Yes | We conducted separate hyperparameter sweeps over learning rate and batch size for each dataset, using early-stopping based on the evaluation set with 3 steps of patience. We used a batch size of 32 and a learning rate of 1e-5 for HH-RLHF dataset and 5e-5 for SHP dataset. We used cosine learning rate schedule with 100 linear warmup steps. |