reproducibilityindex.ai

Language Models with Conformal Factuality Guarantees

Authors: Christopher Mohri, Tatsunori Hashimoto

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output. Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output.
Researcher Affiliation	Academia	1Department of Computer Science, Stanford University.
Pseudocode	Yes	Algorithm 1 α-conformal-factuality algorithm, Algorithm 2 Score computation with Assumption 4.1, Algorithm 3 Score computation without Asssumption 4.1, Algorithm 4 Inference with Ft via sub-claims, Algorithm 5 α-conformal-partial-factuality algorithm
Open Source Code	Yes	We release our code at https://github.com/tatsu-lab/conformal-factual-lm.
Open Datasets	Yes	FAct Score (Min et al., 2023)., Natural Questions (NQ) (Kwiatkowski et al., 2019)., MATH (Hendrycks et al., 2021).
Dataset Splits	Yes	we randomly split our datasets into 25 calibration examples and 25 test examples 1000 times, fitting a threshold on the calibration set and measuring the empirical factuality on the test set.
Hardware Specification	No	The paper mentions using GPT-4 (Open AI, 2023) outputs, implying the use of an API, but does not specify any hardware used for their own computations (e.g., for running the conformal prediction algorithm or sub-claim scoring functions locally).
Software Dependencies	No	The paper mentions using 'GPT-4' and setting 'max_tokens to 1000 and temperature to 0.0', but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used for implementation.
Experiment Setup	Yes	We use the gpt-4 endpoint, set max_tokens to 1000 and temperature to 0.0.