Graph-based Uncertainty Metrics for Long-form Language Model Generations
Authors: Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori B. Hashimoto
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sec. 6.1, we benchmark our proposed graph-based metrics and existing methods adapted for claim-wise uncertainty on two long-form factuality datasets, demonstrating the effectiveness of closeness centrality as a reliable uncertainty measure. In Sec. 6.2, we show that applying the closeness centrality metric with our uncertainty-aware decoding framework demonstrates the best informativeness-factuality trade-off for long-form generation and empirically analyze the impact of each component. Additionally, Sec. 6.3 presents an ablation study to investigate the factors contributing to the performance of closeness centrality and provide insights for interpretation. |
| Researcher Affiliation | Collaboration | Mingjian Jiang Stanford University jiangm@stanford.edu Yangjun Ruan Stanford University ryoungj@stanford.edu Prasanna Sattigeri IBM Research psattig@us.ibm.com Salim Roukos IBM Research roukos@us.ibm.com Tatsunori Hashimoto Stanford University thashim@stanford.edu |
| Pseudocode | No | The paper describes the procedures for graph construction and uncertainty-aware decoding in text, but does not include formal pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our code is available at https://github.com/Mingjianjiang-1/Graph-based-Uncertainty. |
| Open Datasets | Yes | We evaluated the different uncertainty estimation methods on two challenging datasets, FAct Score [12] and (long-form) Pop QA [22]... We also evaluated different methods on the Natural Question dataset [38] |
| Dataset Splits | Yes | The threshold δ can be selected either heuristically in an unsupervised manner or based on a percentile q over training data claims, where the latter approach provides correctness guarantees with factuality probability levels determined by q [16]. Following the supervised approach, we determine δ by selecting a percentile q and computing the corresponding threshold on a small set of training data. |
| Hardware Specification | Yes | We use two 80G A100 to run inference for Llama-3-70B-Instruct. |
| Software Dependencies | No | All the graph metrics are calculated by calling corresponding centrality function in networkx package. (This mentions a package but no version number, and no other software dependencies are listed with versions). |
| Experiment Setup | Yes | Our experiments are conducted on the three most capable LLMs to date (as of June 2024): GPT-3.5-turbo, GPT-4 [2], and Llama-3-70B-Instruct [39]. We used the same LLM to sample responses and construct a semantic entailment graph for uncertainty estimation as described in Sec. 4.1. The graph construction is set up as follows: To construct the set of claims C, we used a greedily decoded sample (temperature t = 0) and 4 samples with temperature t = 1 as RN. To construct the set of responses R in the graph, we used |R| = 5 or |R| = 10 samples, where we included those for obtaining the claims and sampled additional ones with temperature t = 1 if needed. |