Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Authors: Zhengping Jiang, Anqi Liu, Ben Van Durme

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct three sets of experiments to validate key claims in the paper. Section 5.1 shows that by carefully controlling the LLMs to respond less specifically, we can greatly improve their factuality and maintain valid guarantee over factual errors.
Researcher Affiliation Academia Zhengping Jiang Johns Hopkins University EMAIL Anqi Liu Johns Hopkins University EMAIL Benjamin Van Durme Johns Hopkins University EMAIL
Pseudocode No The paper describes the procedure in textual format in Section 4 and illustrates it with a diagram in Figure 2, but does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes 1Code available at: https://github.com/zip Jiang/CLC.
Open Datasets Yes The Simple QA dataset [74] is released under the MIT license. We use the version released with the simple-eval repository.13 The Natural Question dataset [33] is released under the Apache-2.0 license. We use the open subset.14
Dataset Splits No For our experiments, we create calibration set and test set making sure that each answer type is split evenly.
Hardware Specification Yes We compare the following four claim rewriters trained on 2 A100 80G
Software Dependencies No The paper mentions several LLMs used (e.g., Llama3-8B-Instruct, GPT-4o, Qwen2.5-72B-instruct) but does not provide specific version numbers for software dependencies like programming languages or libraries used for implementation.
Experiment Setup Yes We use Llama3-8B-Instruct [15] as base model L, generating K = 100 responses per question. We then produce progressively less precise claims as described in Section 4.1, using GPT-4o while targeting multiplicities τ {20, 30, 40, 50, 60, 70, 80, 90, 100}." and "fine-tune Llama-3-8B-Instruct [15] on the synthetic data generated in Section 5.1, which consists of 2,042 instances from Simple QA and 1,728 instances from NQ with various level of back-off generation.