Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

Authors: Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on several chat models (Meta s Llama 2-Chat, Mistral AI s Mistral 7B Instruct v0.2, and Open AI s GPT3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the Pure Tuning, Safe Testing (PTST) strategy fine-tune models without a safety prompt, but include it at test time.
Researcher Affiliation Academia Kaifeng Lyu1 , Haoyu Zhao1 , Xinran Gu2 , Dingli Yu1, Anirudh Goyal, Sanjeev Arora1 1Computer Science Department & Princeton Language and Intelligence, Princeton Univeristy 2 Institute for Interdisciplinary Information Sciences, Tsinghua University {klyu,arora}@cs.princeton.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code: https://github.com/vfleaking/PTST
Open Datasets Yes Fine-tuning experiments on GSM8K, Chat Doctor, and Open Orca show that PTST significantly reduces the rise of unsafe behaviors.1
Dataset Splits No The paper mentions training and testing but does not explicitly provide validation dataset splits (percentages, counts, or predefined citations) for reproducibility.
Hardware Specification Yes Except for the GPT experiments conducted using the Open AI API, all our experiments were run on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies such as PyTorch, CUDA, or other libraries, which are necessary for reproducible descriptions.
Experiment Setup Yes For each of the 5 templates mentioned above, we fine-tune Llama-2-7b-chat with learning rate 10 4 for 6 epochs, where these two hyperparameters are picked based on the helpfulness performance when the template is chat:vanilla.