Active Preference Learning for Large Language Models
Authors: William Muldrew, Peter Hayes, Mingtian Zhang, David Barber
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments over multiple data sets using open source models with 1 billion parameters, we demonstrate our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data. The focus of our experiments is to determine if more active sampling during the fine-tuning process can bring us gains in data efficiency when dealing with limited labelling budgets; in terms of the rate of learning and the final performance achieved. We compare four different acquisition configurations: random, entropy, certainty and entropy + certainty (as discussed in section 3.1). We evaluate across two different open source large language models and two different datasets used in recent related work. |
| Researcher Affiliation | Academia | William Muldrew 1 Peter Hayes 1 Mingtian Zhang 1 David Barber 1 1Centre for Artificial Intelligence, University College London, London, UK. Correspondence to: William Muldrew <william.muldrew.22@ucl.ac.uk>, Peter Hayes <phayes@cs.ucl.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 Active Preference Learning Procedure |
| Open Source Code | No | The paper does not contain an explicit statement about releasing its source code or a link to a code repository. |
| Open Datasets | Yes | IMDB data from https://huggingface.co/datasets/imdb; randomly truncated to produce a prompt for training data generation and evaluation. Samples of TLDR data from https://huggingface.co/datasets/Carper AI/openai_ summarize_tldr |
| Dataset Splits | No | Convergence was measured on the performance against a validation dataset. We analysed loss and win-rate curves for the different model and dataset combinations see Appendix E for details. While a validation set is mentioned, the paper does not provide specific details on the dataset split percentages or counts for training, validation, and test sets. |
| Hardware Specification | Yes | We ran our fine-tuning on single 40GB RAM A100 and 48GB 600 ADAs GPUs throughout our experiments. |
| Software Dependencies | No | The paper mentions software components like 'Hugging Face' and 'ADAM' and models like 'GPT-2' and 'Pythia', but does not provide specific version numbers for the ancillary software dependencies used in their experiments. |
| Experiment Setup | Yes | Optimizer ADAM lr: 1e-06. Finetuning Epochs 50 70. Mini-batch size 64. Prompt batch size (S) 4000 2048. Acquisition batch size (M) 128 128. β for KL term 0.2. In our experiments we use T = 0.7 for pθ(y|x) during training, T = 0.25 during testing (to encourage lower variance) and T = 0.05 for the GPT-4 oracle to promote deterministic oracle judgements. We use N = 8 samples when approximating the entropy. We use J = 8 for IMDB and J = 4 for TLDR. |