Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees
Authors: Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, Linjun Zhang
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that FACTTEST effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement. Code is here. [...] Evaluation of our proposed framework on question-answering (QA) benchmarks demonstrate several key advantages of our approach: (1) it consistently outperforms base models by a substantial margin without requiring additional training or external data sources; (2) it surpasses fine-tuned baselines by a large margin while utilizing only half of the training data; and (3) it maintains superior performance on out-of-distribution testing data. |
| Researcher Affiliation | Academia | Fan Nie 1 Xiaotian Hou 2 Shuhang Lin 2 James Zou 1 Huaxiu Yao 3 Linjun Zhang 2 1Stanford University, USA 2Rutgers University, USA 3The University of North Carolina at Chapel Hill, USA. Correspondence to: Linjun Zhang <EMAIL>. |
| Pseudocode | No | The paper includes mathematical formulations, equations, and descriptions of procedures, but it does not contain clearly labeled pseudocode or algorithm blocks in a structured, code-like format. For example, Section 2.2 describes the calibration dataset construction and correctness predictor in narrative text and equations. |
| Open Source Code | Yes | Code is here. |
| Open Datasets | Yes | We conduct experiments on knowledge-extensive QA tasks, categorized into two generation tasks. More details are provided in Appendix C.2. Question-Answering: Given a question, the model directly predicts its answer. We include Para Rel (Elazar et al., 2021) and Hotpot QA (Yang et al., 2018). [...] Multiple-Choice: Given a question with several choices, the model chooses one option among A, B and C. We include Wi CE (Kamoi et al., 2023) and FEVER (Thorne et al., 2018a). |
| Dataset Splits | Yes | We randomly split our training dataset, allocating half for instruction-tuning and the remaining half to construct the calibration dataset. [...] Para Rel into two subsets: the first 15 domains serve as in-domain data, and the remaining 16 domains as out-of-domain data (13974 samples). The in-domain data is further split equally into training and testing sets, consisting of 5575 and 5584 samples. [...] We randomly split 1000 samples from Para Rel-OOD as validation samples and the remaining 12k samples as testing samples. |
| Hardware Specification | Yes | All experiments are implemented on 4 Nvidia H100-80GB GPUs. |
| Software Dependencies | No | We follow Zhang et al. (2023) to use LMFlow (Diao et al., 2023) to conduct instruction tuning, setting epoch to 1 and learning rate to 2e 5. While LMFlow is mentioned as a tool, a specific version number for LMFlow or other key software components like Python or PyTorch is not provided. |
| Experiment Setup | Yes | The temperature is set to 0 for evaluation and 0.7 for calculating score functions. We follow Zhang et al. (2023) to use LMFlow (Diao et al., 2023) to conduct instruction tuning, setting epoch to 1 and learning rate to 2e 5. All experiments are implemented on 4 Nvidia H100-80GB GPUs. [...] We set the default value of γ as 90%. |