Discovering Preference Optimization Algorithms with and for Large Language Models

Authors: Chris Lu, Samuel Holt, Claudio Fanconi, Alex Chan, Jakob Foerster, Mihaela van der Schaar, Robert Lange

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate the state-of-the-art performance of Disco POP and its successful transfer to held-out tasks. Specifically, we iteratively prompt an LLM to propose and implement new preference optimization loss functions based on previously evaluated performance metrics.
Researcher Affiliation Collaboration Chris Lu Sakana AI and FLAIR chrislu@sakana.ai Samuel Holt University of Cambridge sih31@cam.ac.uk Claudio Fanconi University of Cambridge caf83@cam.ac.uk Alex J. Chan University of Cambridge ajc340@cam.ac.uk Jakob Foerster FLAIR, University of Oxford jakob.foerster@eng.ox.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk Robert Tjarko Lange Sakana AI robert@sakana.ai
Pseudocode Yes Algorithm 1 LLM-Driven Objective Discovery
Open Source Code Yes The best performing of these we call Discovered Preference Optimization (Disco POP)1, a novel algorithm that adaptively blends logistic and exponential losses. 1Code: https://github.com/luchris429/Disco POP. The code can also be accessed at https://github.com/samholt/ Disco POP.
Open Datasets Yes Specifically, we build on top of the alignment-handbook [Tunstall et al., 2023a] repository to finetune our models... It is then trained on the pairwise preference dataset of Argilla DPO Mix 7K 4. For each training run, we trained all the parameters of the starting model, using a fixed β = 0.05. ...finetune zephyr-7b-gemma-sft using 10% of the Reddit TL;DR summarization preference dataset [Völske et al., 2017]... IMDb dataset [Maas et al., 2011]
Dataset Splits No The paper uses MT-Bench and Alpaca Eval as evaluation benchmarks, and mentions a 10% training subsample and a 694-sample test set for the TL;DR summarization task. However, it does not provide explicit train/validation/test splits (e.g., percentages or counts) for the primary datasets used in the main discovery process, nor does it specify a validation set split for all experimental setups.
Hardware Specification Yes The models were trained on 8 Nvidia A100 GPUs. LLMs were trained 4 Nvidia A100 GPUS. The models are trained on 4 Nvidia A100 GPUs.
Software Dependencies No The paper mentions software like 'Py Torch', 'alignment-handbook', and 'TRL transformers library' but does not specify their version numbers.
Experiment Setup Yes Specifically, we used a learning rate of 5e-7, bfloat16 floating-point format, two epochs, a batch size per device of two, a gradient accumulation step of 8, a cosine learning rate scheduler, and Adam W optimization algorithm [Loshchilov and Hutter, 2017].