Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FactTest: Factuality Testing in Large Language Models with Finite-Sample and Distribution-Free Guarantees

Authors: Fan Nie, Xiaotian Hou, Shuhang Lin, James Zou, Huaxiu Yao, Linjun Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that FACTTEST effectively detects hallucinations and enable LLMs to abstain from answering unknown questions, leading to an over 40% accuracy improvement. Code is here. [...] Evaluation of our proposed framework on question-answering (QA) benchmarks demonstrate several key advantages of our approach: (1) it consistently outperforms base models by a substantial margin without requiring additional training or external data sources; (2) it surpasses fine-tuned baselines by a large margin while utilizing only half of the training data; and (3) it maintains superior performance on out-of-distribution testing data.
Researcher Affiliation Academia Fan Nie 1 Xiaotian Hou 2 Shuhang Lin 2 James Zou 1 Huaxiu Yao 3 Linjun Zhang 2 1Stanford University, USA 2Rutgers University, USA 3The University of North Carolina at Chapel Hill, USA. Correspondence to: Linjun Zhang <EMAIL>.
Pseudocode No The paper includes mathematical formulations, equations, and descriptions of procedures, but it does not contain clearly labeled pseudocode or algorithm blocks in a structured, code-like format. For example, Section 2.2 describes the calibration dataset construction and correctness predictor in narrative text and equations.
Open Source Code Yes Code is here.
Open Datasets Yes We conduct experiments on knowledge-extensive QA tasks, categorized into two generation tasks. More details are provided in Appendix C.2. Question-Answering: Given a question, the model directly predicts its answer. We include Para Rel (Elazar et al., 2021) and Hotpot QA (Yang et al., 2018). [...] Multiple-Choice: Given a question with several choices, the model chooses one option among A, B and C. We include Wi CE (Kamoi et al., 2023) and FEVER (Thorne et al., 2018a).
Dataset Splits Yes We randomly split our training dataset, allocating half for instruction-tuning and the remaining half to construct the calibration dataset. [...] Para Rel into two subsets: the first 15 domains serve as in-domain data, and the remaining 16 domains as out-of-domain data (13974 samples). The in-domain data is further split equally into training and testing sets, consisting of 5575 and 5584 samples. [...] We randomly split 1000 samples from Para Rel-OOD as validation samples and the remaining 12k samples as testing samples.
Hardware Specification Yes All experiments are implemented on 4 Nvidia H100-80GB GPUs.
Software Dependencies No We follow Zhang et al. (2023) to use LMFlow (Diao et al., 2023) to conduct instruction tuning, setting epoch to 1 and learning rate to 2e 5. While LMFlow is mentioned as a tool, a specific version number for LMFlow or other key software components like Python or PyTorch is not provided.
Experiment Setup Yes The temperature is set to 0 for evaluation and 0.7 for calculating score functions. We follow Zhang et al. (2023) to use LMFlow (Diao et al., 2023) to conduct instruction tuning, setting epoch to 1 and learning rate to 2e 5. All experiments are implemented on 4 Nvidia H100-80GB GPUs. [...] We set the default value of γ as 90%.