Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Authors: Sanjana Ramprasad, Byron Wallace
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we stress test a range of automatic factuality metrics, including specialized models and LLM-based prompting methods, to probe what they actually capture. Using a shallow classifier to separate easy examples for factual evaluation where surface features suffice from hard cases requiring deeper reasoning, we find that all metrics show substantial performance drops on the latter. Furthermore, some metrics are more sensitive to benign, fact-preserving edits than to factual corrections. Building on this observation, we demonstrate that most automatic factuality metrics can be gamed, i.e., their scores can be artificially inflated by appending innocuous, content-free sentences to summaries. Among the metrics tested, the LLM-based Chat GPT-DA approach is the most robust and reliable. |
| Researcher Affiliation | Academia | Sanjana Ramprasad Northeastern University EMAIL Byron C. Wallace Northeastern University EMAIL |
| Pseudocode | No | The paper describes methods in paragraph text and refers to model architectures (e.g., MLP, T5, RoBERTa) but does not include any explicit pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | Answer: [No] Justification: We plan to release full code and will strive to prior to the conference. |
| Open Datasets | Yes | For our analysis we use all of the above benchmarks to capture a wide range of error types. For fine-tuned model summaries, we use Aggre Fact for news and Fac Eval for dialogues. For LLM-generated summaries, we rely on LLM-Aggre Fact, Gen Audit, and LLM-dialogue. We note that each benchmark consolidates multiple datasets and to ensure clean separation of distributions, we avoid overlapping datasets between our test and development splits. |
| Dataset Splits | Yes | Specifically, our dev set includes summaries from the Aggre Fact dev split, as well as XSUM and CNNDM examples from Genaudit, ensuring no overlap with test data. All remaining datasets are evaluated using their respective test splits. We provide a detailed breakdown of our dev and test splits in Appendix A. |
| Hardware Specification | No | Answer: [NA] Justification: All results include an analysis of popular and easy to run metrics appropriately cited. |
| Software Dependencies | No | The paper mentions using specific models like T5 [Raffel et al., 2020], RoBERTa [Liu, 2019], Flan-T5 [Chung et al., 2022], and GPT-4o-mini for evaluation. It also mentions BERT [Devlin, 2018] for embeddings in Appendix B. However, it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | To investigate the extent to which shallow features explain metric behavior, we train an MLP classifier to predict binary human factuality labels on a development set using only surface-level features.1 We then apply the trained model to an evaluation set and categorize summaries into three difficulty levels easy, medium, and hard based on prediction accuracy and confidence. Confidence is measured as the absolute deviation of the predicted probability from 0.5: Lower values indicate greater uncertainty. ... We train an MLP classification model to predict factuality labels using shallow features, with two hidden layers of dimensions 100 and 50 with a learning rate of 0.001. We also use GPT-4o-mini to score the factual consistency of summaries based on a direct assessment (DA) prompt template from Wang et al. [2023a]. |