LitCab: Lightweight Language Model Calibration over Short- and Long-form Responses

Authors: Xin Liu, Muhammad Khalifa, Lu Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply LITCAB to Llama2-7B (Touvron et al., 2023b) and compare it against several competitive baselines, including post-processing, training-, verbalization, and consistency-based methods (Kuhn et al., 2023; Kadavath et al., 2022; Xiong et al., 2023). Our experiments demonstrate the effectiveness of LITCAB, which exhibits uniformly superior calibration than baselines across the text generation benchmarks. During calibration evaluation, we note that existing work mainly studies short answer QA (Tian et al., 2023; Xiong et al., 2023), little attention has been given to LM calibration over long-form outputs. To address this gap, we construct and release Calibration evalua Tion Benchmark (CAT) consisting of eight text generation tasks that cover generations encompassing phrases, sentences, and up to paragraphs. We further conduct extensive evaluation over seven open-source LMs, including GPT (Radford et al., 2019; Wang & Komatsuzaki, 2021), LLa MA (Touvron et al., 2023a), Llama2 (Touvron et al., 2023b), and Vicuna (Chiang et al., 2023), with sizes ranging from 1.5B to 30B.
Researcher Affiliation Academia Xin Liu, Muhammad Khalifa, Lu Wang Computer Science and Engineering University of Michigan Ann Arbor, MI {liuxincs, khalifam, wangluxy}@umich.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes Data and code are available at https://github.com/launchnlp/LitCab.
Open Datasets Yes For phrase-level generation datasets, we use Natural Questions (NQ), Sci Q, and Trivia QA, all of which include short responses (e.g., named entities). For sentence-level responses, we choose Truthful QA and Wiki QA, where model responses are complete sentences. For paragraph-level generations, we prompt the LMs to write biographies of different figures (celebrities, scientists, etc.), whose names are sourced from Bio Gen (Min et al., 2023). Additionally, we employ QAMPARI (Amouyal et al., 2022), a question-answering task with multiple answers.
Dataset Splits Yes For searching the optimal hyperparameter of LITCAB and comparison methods, we use 20% of the training samples for validation and train LITCAB on the remaining samples. For phrase- and sentence-level tasks, we randomly select 1K samples as test set (if the original test data size exceeds 1K) and 2K samples for training (if the original training data has more than 2K samples). Given that there is no official training set for Truthful QA, we randomly select 397 instances from the original test set for training and use the rest for testing in our experiments. For Bio Gen, we collect 683 people’s names provided by Min et al. (2023) in total. Of these names, 183 are utilized for evaluation purposes, while the remaining 500 are employed to generate both correct and incorrect claims for training LITCAB. Similarly, for the Wiki Gen task, we randomly select 600 entities from the FEVER dataset, each linking to a specific Wikipedia passage. Among these entities, 100 are designated for evaluation, while the remaining 500 are used for training LITCAB. Regarding QAMPARI, we randomly selected 1K samples for testing and 2K samples for training.
Hardware Specification Yes All LMs run on a single GPU with 48GB. We use fp16 for 13B-sized models. For models with more than 13 billion parameters, we apply a quantization technique, called GPT-Q algorithm (Frantar et al., 2022), it also enables mixed precision of int4 and float16 during inference.
Software Dependencies No The paper mentions techniques and algorithms like Lo RA and GPT-Q but does not provide specific version numbers for software libraries or frameworks like Python, PyTorch, or CUDA.
Experiment Setup Yes For searching the optimal hyperparameter of LITCAB and comparison methods, we use 20% of the training samples for validation and train LITCAB on the remaining samples. We use a training batch size of 128 and a learning rate of 1e-5. We train LITCAB for 50 epochs with early stopping. To prevent excessive adjustment of the LM’s predicted logits, we initialize LITCAB’s weights to zero. All LMs run on a single GPU with 48GB. We use fp16 for 13B-sized models. For models with more than 13 billion parameters, we apply a quantization technique, called GPT-Q algorithm (Frantar et al., 2022), it also enables mixed precision of int4 and float16 during inference. To implement the over-smoothing comparison method, we train the entire LLM, set the Lo RA rank to 8 with a learning rate of 3e-4, utilize cross-entropy loss with the label smoothing attribute set to 0.1, and train the model for 10 epochs. For the self-consistency method, we consider two model generations as semantically equivalent if they entail each other. We sample 10 generations for each question and use the largest cluster of semantically equivalent generations to estimate the LM confidence.