Active Preference Learning for Large Language Models

Authors: William Muldrew, Peter Hayes, Mingtian Zhang, David Barber

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments over multiple data sets using open source models with 1 billion parameters, we demonstrate our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data. The focus of our experiments is to determine if more active sampling during the fine-tuning process can bring us gains in data efficiency when dealing with limited labelling budgets; in terms of the rate of learning and the final performance achieved. We compare four different acquisition configurations: random, entropy, certainty and entropy + certainty (as discussed in section 3.1). We evaluate across two different open source large language models and two different datasets used in recent related work.
Researcher Affiliation Academia William Muldrew 1 Peter Hayes 1 Mingtian Zhang 1 David Barber 1 1Centre for Artificial Intelligence, University College London, London, UK. Correspondence to: William Muldrew <william.muldrew.22@ucl.ac.uk>, Peter Hayes <phayes@cs.ucl.ac.uk>.
Pseudocode Yes Algorithm 1 Active Preference Learning Procedure
Open Source Code No The paper does not contain an explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes IMDB data from https://huggingface.co/datasets/imdb; randomly truncated to produce a prompt for training data generation and evaluation. Samples of TLDR data from https://huggingface.co/datasets/Carper AI/openai_ summarize_tldr
Dataset Splits No Convergence was measured on the performance against a validation dataset. We analysed loss and win-rate curves for the different model and dataset combinations see Appendix E for details. While a validation set is mentioned, the paper does not provide specific details on the dataset split percentages or counts for training, validation, and test sets.
Hardware Specification Yes We ran our fine-tuning on single 40GB RAM A100 and 48GB 600 ADAs GPUs throughout our experiments.
Software Dependencies No The paper mentions software components like 'Hugging Face' and 'ADAM' and models like 'GPT-2' and 'Pythia', but does not provide specific version numbers for the ancillary software dependencies used in their experiments.
Experiment Setup Yes Optimizer ADAM lr: 1e-06. Finetuning Epochs 50 70. Mini-batch size 64. Prompt batch size (S) 4000 2048. Acquisition batch size (M) 128 128. β for KL term 0.2. In our experiments we use T = 0.7 for pθ(y|x) during training, T = 0.25 during testing (to encourage lower variance) and T = 0.05 for the GPT-4 oracle to promote deterministic oracle judgements. We use N = 8 samples when approximating the entropy. We use J = 8 for IMDB and J = 4 for TLDR.