Language Models with Conformal Factuality Guarantees
Authors: Christopher Mohri, Tatsunori Hashimoto
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output. Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University. |
| Pseudocode | Yes | Algorithm 1 α-conformal-factuality algorithm, Algorithm 2 Score computation with Assumption 4.1, Algorithm 3 Score computation without Asssumption 4.1, Algorithm 4 Inference with Ft via sub-claims, Algorithm 5 α-conformal-partial-factuality algorithm |
| Open Source Code | Yes | We release our code at https://github.com/tatsu-lab/conformal-factual-lm. |
| Open Datasets | Yes | FAct Score (Min et al., 2023)., Natural Questions (NQ) (Kwiatkowski et al., 2019)., MATH (Hendrycks et al., 2021). |
| Dataset Splits | Yes | we randomly split our datasets into 25 calibration examples and 25 test examples 1000 times, fitting a threshold on the calibration set and measuring the empirical factuality on the test set. |
| Hardware Specification | No | The paper mentions using GPT-4 (Open AI, 2023) outputs, implying the use of an API, but does not specify any hardware used for their own computations (e.g., for running the conformal prediction algorithm or sub-claim scoring functions locally). |
| Software Dependencies | No | The paper mentions using 'GPT-4' and setting 'max_tokens to 1000 and temperature to 0.0', but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used for implementation. |
| Experiment Setup | Yes | We use the gpt-4 endpoint, set max_tokens to 1000 and temperature to 0.0. |