Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FACT: Mitigating Inconsistent Hallucinations in LLMs via Fact-Driven Alternating Code-Text Training

Authors: Xinxin You, Qixin Sun, Chenwei Yan, Xiao Zhang, Chen Ning, Xiangling Fu, Si Liu, Guoping Hu, Shijin Wang, Ji Wu, Xien Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that with only a small subset of Wiki-40B-en for training, FACT reduces inconsistent hallucinations by 2.7% 8.0% and improves overall performance by 2.5% 6.1% in three leading LLMs and four diverse datasets covering QA and summarization tasks.
Researcher Affiliation Collaboration 1Department of Electronic Engineering, Tsinghua University, Beijing, China 2School of Artifcial Intelligence, Beihang University, Beijing, China 3School of Information Technology and Management, University of International Business and Economics, Beijing, China 4Byte Dance, Beijing, China 5School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China 6i FLYTEK Research, Hefei, China 7College of AI, Tsinghua University, Beijing, China
Pseudocode No The paper describes methods and processes in narrative text and illustrative figures (e.g., Figure 1, Figure 2), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We use publicly available datasets, and we will provide a anonymized link to the code repository.
Open Datasets Yes We randomly sampled 10,000 entries from the Wiki-40B-en[24] dataset... For evaluation, we selected benchmark datasets for both text summarization and question answering (QA). Specifically, we used CNN/Daily Mail[35] and SAMSum[36] for summarization, and SQu AD v2[37, 38] together with Halu Eval[39] for QA.
Dataset Splits No We randomly sampled 10,000 entries from the Wiki-40B-en[24] dataset, using only the first paragraph of each entry. After fact-based filtering, 53.74% (5,374) were retained for alternating training, while 27.85% and 18.41% were labeled as non-factual and invalid... For fair comparison, we also randomly sampled 10,748 instances from each dataset for training the baseline methods.
Hardware Specification Yes For the Base Model and Prompting variants of each backbone, inference was conducted using LLa MA-Factory2 on four NVIDIA Ge Force RTX 4090 GPUs (24GB each)... FACT and SFT were also trained on the same four RTX 4090 GPUs with LLa MA-Factory
Software Dependencies No inference was conducted using LLa MA-Factory2... For the Symb Co T and Lookback baselines, we used the official implementations and configurations, replacing their backbone models with LLa MA, Mistral, and Qwen.
Experiment Setup Yes FACT and SFT were also trained on the same four RTX 4090 GPUs with LLa MA-Factory for three epochs, with a learning rate of 1 10 4, batch size 32, and Lo RA (rank 8) for consistent acceleration.