Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Authors: Zhengping Jiang, Anqi Liu, Ben Van Durme

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct three sets of experiments to validate key claims in the paper. Section 5.1 shows that by carefully controlling the LLMs to respond less specifically, we can greatly improve their factuality and maintain valid guarantee over factual errors.
Researcher Affiliation	Academia	Zhengping Jiang Johns Hopkins University EMAIL Anqi Liu Johns Hopkins University EMAIL Benjamin Van Durme Johns Hopkins University EMAIL
Pseudocode	No	The paper describes the procedure in textual format in Section 4 and illustrates it with a diagram in Figure 2, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	1Code available at: https://github.com/zip Jiang/CLC.
Open Datasets	Yes	The Simple QA dataset [74] is released under the MIT license. We use the version released with the simple-eval repository.13 The Natural Question dataset [33] is released under the Apache-2.0 license. We use the open subset.14
Dataset Splits	No	For our experiments, we create calibration set and test set making sure that each answer type is split evenly.
Hardware Specification	Yes	We compare the following four claim rewriters trained on 2 A100 80G
Software Dependencies	No	The paper mentions several LLMs used (e.g., Llama3-8B-Instruct, GPT-4o, Qwen2.5-72B-instruct) but does not provide specific version numbers for software dependencies like programming languages or libraries used for implementation.
Experiment Setup	Yes	We use Llama3-8B-Instruct [15] as base model L, generating K = 100 responses per question. We then produce progressively less precise claims as described in Section 4.1, using GPT-4o while targeting multiplicities τ {20, 30, 40, 50, 60, 70, 80, 90, 100}." and "fine-tune Llama-3-8B-Instruct [15] on the synthetic data generated in Section 5.1, which consists of 2,042 instances from Simple QA and 1,728 instances from NQ with various level of back-off generation.