Language Models with Conformal Factuality Guarantees

Authors: Christopher Mohri, Tatsunori Hashimoto

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output. Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output.
Researcher Affiliation Academia 1Department of Computer Science, Stanford University.
Pseudocode Yes Algorithm 1 α-conformal-factuality algorithm, Algorithm 2 Score computation with Assumption 4.1, Algorithm 3 Score computation without Asssumption 4.1, Algorithm 4 Inference with Ft via sub-claims, Algorithm 5 α-conformal-partial-factuality algorithm
Open Source Code Yes We release our code at https://github.com/tatsu-lab/conformal-factual-lm.
Open Datasets Yes FAct Score (Min et al., 2023)., Natural Questions (NQ) (Kwiatkowski et al., 2019)., MATH (Hendrycks et al., 2021).
Dataset Splits Yes we randomly split our datasets into 25 calibration examples and 25 test examples 1000 times, fitting a threshold on the calibration set and measuring the empirical factuality on the test set.
Hardware Specification No The paper mentions using GPT-4 (Open AI, 2023) outputs, implying the use of an API, but does not specify any hardware used for their own computations (e.g., for running the conformal prediction algorithm or sub-claim scoring functions locally).
Software Dependencies No The paper mentions using 'GPT-4' and setting 'max_tokens to 1000 and temperature to 0.0', but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup Yes We use the gpt-4 endpoint, set max_tokens to 1000 and temperature to 0.0.