Group Preference Optimization: Few-Shot Alignment of Large Language Models

Authors: Siyan Zhao, John Dang, Aditya Grover

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.
Researcher Affiliation Academia Siyan Zhao, John Dang, Aditya Grover Department of Computer Science, University of California, Los Angeles {siyanz,john.dang,adityag}@cs.ucla.edu
Pseudocode Yes Algorithm 1 Group Preference Optimization (GPO)
Open Source Code Yes Our code is available at the project website: https://siyan-zhao.github.io/llm-gpo/
Open Datasets Yes We benchmark group alignment on 2 recent survey datasets: (1) Opinion QA (Santurkar et al., 2023), which spans 22 US demographic groups (e.g. income, political ideology, race, and sex) across 500 multiple-choice questions and (2) Global Opinion QA (Durmus et al., 2023), which contains multiple-choice questions answered by participants from 14 countries, amounting to 2,554 questions which cover various topics including politics, media, technology, religion, race, and ethnicity.
Dataset Splits Yes For all methods, We use the validation alignment score for early stopping.
Hardware Specification Yes For all baseline fine-tuning methods, including SFT per group, reward modeling, and in-context fine-tuning that necessitate training the base LM, we employ 8-bit integer quantization and utilize a single Nvidia RTX A6000 GPU with 48GB VRAM.
Software Dependencies No The paper mentions optimizers (AdamW, Adam) and techniques (LoRA, bf16 precision) but does not provide specific software names with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes Our parameter search for the learning rate encompassed values {3e-4, 2e-5, 1e-4}. We settled on 1e-4 for the Alpaca baselines and 2e-5 for the Llama2-13B-chat baselines. For both SFT and in-context fine-tuning tasks, our effective batch size was 8, comprised of a batch size of 1 and 8 gradient accumulation steps. In contrast, reward model training had a batch size of 4 with the same gradient accumulation steps. All baseline methodologies were trained with Lo RA (with r=12, alpha=32, and a dropout rate of 0.05) with a weight decay of 0.01, utilizing bf16 precision and the Adam W optimizer (Loshchilov & Hutter, 2018). For GPO, the transformer s feedforward dimension was set to 128, with an embedding depth of 4, 4 heads, and 6 layers. We sampled m uniformly from the range [10, 100] as context samples for every training task. We also used a learning rate of 3e-4, coupled with the Adam Optimizer (Kingma & Ba, 2015).