Graph-based Uncertainty Metrics for Long-form Language Model Generations

Authors: Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori B. Hashimoto

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Sec. 6.1, we benchmark our proposed graph-based metrics and existing methods adapted for claim-wise uncertainty on two long-form factuality datasets, demonstrating the effectiveness of closeness centrality as a reliable uncertainty measure. In Sec. 6.2, we show that applying the closeness centrality metric with our uncertainty-aware decoding framework demonstrates the best informativeness-factuality trade-off for long-form generation and empirically analyze the impact of each component. Additionally, Sec. 6.3 presents an ablation study to investigate the factors contributing to the performance of closeness centrality and provide insights for interpretation.
Researcher Affiliation Collaboration Mingjian Jiang Stanford University jiangm@stanford.edu Yangjun Ruan Stanford University ryoungj@stanford.edu Prasanna Sattigeri IBM Research psattig@us.ibm.com Salim Roukos IBM Research roukos@us.ibm.com Tatsunori Hashimoto Stanford University thashim@stanford.edu
Pseudocode No The paper describes the procedures for graph construction and uncertainty-aware decoding in text, but does not include formal pseudocode blocks or algorithm listings.
Open Source Code Yes Our code is available at https://github.com/Mingjianjiang-1/Graph-based-Uncertainty.
Open Datasets Yes We evaluated the different uncertainty estimation methods on two challenging datasets, FAct Score [12] and (long-form) Pop QA [22]... We also evaluated different methods on the Natural Question dataset [38]
Dataset Splits Yes The threshold δ can be selected either heuristically in an unsupervised manner or based on a percentile q over training data claims, where the latter approach provides correctness guarantees with factuality probability levels determined by q [16]. Following the supervised approach, we determine δ by selecting a percentile q and computing the corresponding threshold on a small set of training data.
Hardware Specification Yes We use two 80G A100 to run inference for Llama-3-70B-Instruct.
Software Dependencies No All the graph metrics are calculated by calling corresponding centrality function in networkx package. (This mentions a package but no version number, and no other software dependencies are listed with versions).
Experiment Setup Yes Our experiments are conducted on the three most capable LLMs to date (as of June 2024): GPT-3.5-turbo, GPT-4 [2], and Llama-3-70B-Instruct [39]. We used the same LLM to sample responses and construct a semantic entailment graph for uncertainty estimation as described in Sec. 4.1. The graph construction is set up as follows: To construct the set of claims C, we used a greedily decoded sample (temperature t = 0) and 4 samples with temperature t = 1 as RN. To construct the set of responses R in the graph, we used |R| = 5 or |R| = 10 samples, where we included those for obtaining the claims and sampled additional ones with temperature t = 1 if needed.