FLAME : Factuality-Aware Alignment for Large Language Models

Authors: Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, Xilun Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our proposed FLAME guides LLMs to output more factual responses while maintaining their instruction-following capability.
Researcher Affiliation Collaboration University of Waterloo1, Carnegie Mellon University2, Meta AI3
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No While we do not provide the code to reproduce the main experimental results, we provide all the necessary information and URL links to training and evaluation data.
Open Datasets Yes At SFT stage, we fine-tune PT on two seed datasets: (1) Instruction-following training (IFT) data from Li et al. [2024], consisting of 3200 instruction response pairs created by humans from Open Assistant dataset [OASST; Köpf et al., 2023]; (2) evaluation following training (EFT) data from Yuan et al. [2024]
Dataset Splits No For the experiment, we compile training and evaluation datasets comprising 500 and 183 diverse human entities, respectively (further details provided in Appendix A.1). The paper explicitly mentions training and evaluation (test) sets for some experiments but does not explicitly define a separate validation set.
Hardware Specification Yes We conduct fine-tuning with full parameters on 64 NVIDIA A100 (80GB) GPUs.
Software Dependencies No The paper mentions several software components and models (e.g., Llama-2 70B, FACTSCORE, DRAGON+, nltk.tokenize) but does not provide specific version numbers for these software dependencies to ensure reproducibility.
Experiment Setup Yes We fine-tune our models for 500 steps with a batch size of 32 and 64 on respective SFT and DPO stages. The learning rate and maximum sequence length is set to 1e 6 (which decays to 1e 7) and 2048, respectively. At SFT stage, we mix the IFT and EFT while at DPO stage, we set β = 0.1 and uniformly sample between self rewarding (x, y+, y ) and factuality reward (x, ytrue, yfalse) preference data.