Improving Instruction Following in Language Models through Proxy-Based Uncertainty Estimation

Authors: Joonho Lee, Jae Oh Woo, Juree Seok, Parisa Hassanzadeh, Wooseok Jang, Juyoun Son, Sima Didari, Baruch Gutow, Heng Hao, Hankyu Moon, Wenjun Hu, Yeong-Dae Kwon, Taehee Lee, Seungjai Min

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results demonstrate significant benefits of incorporating the proposed proxy into language model training. Our method boosts the instruction following capability of language models by refining data curation for training and improving policy optimization objectives, thereby surpassing existing methods by a large margin on benchmarks such as Vicuna and MT-bench.
Researcher Affiliation Industry 1Samsung SDS Technology Research, Seoul, Korea 2Samsung SDS America, San Jose, California, USA.
Pseudocode No The paper provides mathematical formulations and equations (e.g., Equation 3 for URM loss, Equation 4 for DPO loss, Equation 5 for UDPO objective) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/P-B-U/ proxy_based_uncertainty.
Open Datasets Yes We leverage the mixture of publicly available preference datasets D = {X, Yc, Yr} comprising pairs of responses (yc, yr) to an instruction x. By framing the comparison as a binary classification, we can train a proxy model uϕ, or URM, using the negative log-likelihood loss as follows: We conducted a single training epoch using the complete reward model training set as described in Table 1.
Dataset Splits No The paper mentions training data, evaluation benchmarks (Vicuna-Bench, MT-Bench), and using "held-out data" for URM performance, but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) for its main experiments. The term "validation" is not used in the context of data splitting.
Hardware Specification No The paper mentions the use of various language models (e.g., Pythia 1.4B, Pythia 6.9B, Llama 2 7B, Llama 2 13B) but does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software components and techniques like "Adam W optimizer" and "Lo RA (Hu et al., 2021)" but does not specify their version numbers or other crucial software dependencies required for reproducibility.
Experiment Setup Yes We conducted a single training epoch on Llama 2-Chat 7B model using the complete reward model training set as described in Table 1. For Anthropic Harmless we used (Cai1 et al., 2023) which generated Anthropic Harmless dataset s chosen responses by GPT-4 to enhance quality. In this paper, Anthropic Harmless refers to GPT-4 augmented version. For Beaver Tails we filtered dataset by safety first and chose better response answers to get safe and helpful dataset. The batch size was set to 8, which represents the distinct number of instructions per batch. We employed a cosine learning rate schedule with an initial learning rate of 10^-5. Changes of up to 50% in the learning rate did not significantly impact performance, whereas using multiple epochs led to overfitting. We train the model for 1 epoch and use a batch size of 8, a learning rate of 10^-4 with a constant learning rate scheduler. Our experimental setup mostly follows DPO (Rafailov et al., 2023) with one exception: unlike DPO, we apply a constant learning rate 1 10^-6 after an initial warm-up of 3%. Adam W optimizer and a batch size of 64 are applied for one epoch. Following the settings of C-RLFT, the model was trained for five epochs, and we used hyperparameter settings such as learning rate 5 10^-6, optimizer Adam W, batch size 32, warmup ratio 0.06, and Warmup LR schedular. For Dolly dataset, the model was trained for four epochs.