Towards Efficient Exact Optimization of Language Model Alignment

Authors: Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, Minlie Huang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a series of experiments to verify the effectiveness and scalability of EXO.
Researcher Affiliation Collaboration Haozhe Ji 1 Cheng Lu 2 Yilin Niu 3 Pei Ke 1 Hongning Wang 1 Jun Zhu 2 Jie Tang 4 Minlie Huang 1 1The Conversational AI (Co AI) Group, Tsinghua University 2The Tsinghua Statistical Artificial Intelligence & Learning (TSAIL) Group, Tsinghua University 3Zhipu AI 4The Knowledge Engineering Group (KEG), Tsinghua University.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). The appendices contain mathematical proofs and derivations, not pseudocode.
Open Source Code Yes Code is available at https://github.com/haozheji/ exact-optimization.
Open Datasets Yes In the controlled text generation task, the policy is optimized to generate a completion y with positive sentiment given a prefix x of a movie review from the IMDB dataset2 (Maas et al., 2011). ... We use the same filtered version3 of the Reddit TL;DR summarization dataset (V olske et al., 2017) to train the SFT policy and use their preference dataset4 for the alignment problem. In the dialogue generation task, ... We use the helpfulness subset of the Anthropic Helpful and Harmless dialogue dataset5 (Bai et al., 2022) as the preference dataset and train the SFT policy using the chosen responses.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing subsets. It mentions a "test set" but not the overall data partitioning.
Hardware Specification Yes We conduct the experiments except for instruction following on 8 V100 GPUs. For instruction following task, we train the models on 8 A100 GPUs.
Software Dependencies No The paper mentions software components like "Adam optimizer" and "GPT-2 large model" but does not provide specific version numbers for any software dependencies or libraries needed to replicate the experiments.
Experiment Setup Yes We set the same hyperparameters (e.g., βπ, βr) for EXO and DPO across different settings and datasets... For DPO and EXO, we use the Adam optimizer with a universal learning rate of 1e-6 and a batch size of 64 and train for one epoch on each dataset... To ensure sample quality, we use a temperature of τ = 0.8 to divide the logits of the language model in all experiments. Lastly, for the instruction following task, ... We set the label smoothing hyperparameter ε in EXOpref to 1e-3.