Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Language Models with Conformal Factuality Guarantees
Authors: Christopher Mohri, Tatsunori Hashimoto
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output. Evaluations of our approach on closed book QA (FAct Score, Natural Questions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM s original output. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Stanford University. |
| Pseudocode | Yes | Algorithm 1 α-conformal-factuality algorithm, Algorithm 2 Score computation with Assumption 4.1, Algorithm 3 Score computation without Asssumption 4.1, Algorithm 4 Inference with Ft via sub-claims, Algorithm 5 α-conformal-partial-factuality algorithm |
| Open Source Code | Yes | We release our code at https://github.com/tatsu-lab/conformal-factual-lm. |
| Open Datasets | Yes | FAct Score (Min et al., 2023)., Natural Questions (NQ) (Kwiatkowski et al., 2019)., MATH (Hendrycks et al., 2021). |
| Dataset Splits | Yes | we randomly split our datasets into 25 calibration examples and 25 test examples 1000 times, fitting a threshold on the calibration set and measuring the empirical factuality on the test set. |
| Hardware Specification | No | The paper mentions using GPT-4 (Open AI, 2023) outputs, implying the use of an API, but does not specify any hardware used for their own computations (e.g., for running the conformal prediction algorithm or sub-claim scoring functions locally). |
| Software Dependencies | No | The paper mentions using 'GPT-4' and setting 'max_tokens to 1000 and temperature to 0.0', but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used for implementation. |
| Experiment Setup | Yes | We use the gpt-4 endpoint, set max_tokens to 1000 and temperature to 0.0. |