Alignment for Honesty

Authors: Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source all relevant resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty.
Researcher Affiliation Collaboration Yuqing Yang3,5 Ethan Chern1,5 Xipeng Qiu3 Graham Neubig4 Pengfei Liu1,2,5 1Shanghai Jiao Tong University 2Shanghai Artificial Intelligence Laboratory 3Fudan University 4Carnegie Mellon University 5Generative AI Research Lab (GAIR)
Pseudocode No The paper describes methods and illustrates concepts with figures, but it does not include a formal pseudocode or algorithm block.
Open Source Code Yes We open-source all relevant resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty.
Open Datasets Yes To perform honesty-oriented supervised fine-tuning, we sample 8,000 data from a large-scale knowledge-based questions answering (QA) dataset, Trivia QA (Joshi et al., 2017), as our training dataset, and label contrastive samples as described in 3.2.
Dataset Splits No The paper mentions training and evaluation (test) datasets but does not explicitly specify a separate validation dataset split with percentages or counts for hyperparameter tuning or model selection.
Hardware Specification Yes All experiments were conducted using A100 GPUs.
Software Dependencies No The paper mentions using 'Co LLi E' for full parameter fine-tuning and the 'Adam W optimizer' but does not provide specific version numbers for these or other software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For model training, we rely on Co LLi E9 (Lv et al., 2023) for full parameter fine-tuning. In particular, we utilized the Adam W optimizer (Loshchilov and Hutter, 2019) with a learning rate of 1e-6 and a weight decay of 0.1. We trained MULTISAMPLE for 1 epoch and other methods for 2 epochs, with a warm-up ratio set to 0.05 and batch size 8.