Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Conformal Language Model Reasoning with Coherent Factuality

Authors: Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, Surbhi Goel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels.
Researcher Affiliation	Collaboration	1University of Pennsylvania, 2IBM Research AI
Pseudocode	Yes	Algorithm 1: Subgraph Generator Algorithm 2: Ideal Graph Assembly Algorithm 3: Coherent Calibration
Open Source Code	Yes	1Code is available at https://github.com/maxrubintoles/Conformal_LM_Reasoning
Open Datasets	Yes	Our experiments make use of the MATH dataset (Hendrycks et al., 2021), which spans various branches of mathematics. ... We also use the FELM dataset Chen et al., 2023a
Dataset Splits	Yes	For each example in the calibration and test set, the algorithm requires 8 queries comprising at most 16k tokens; for our calibration set of 50 examples, this cost less than $5.00 using GPT and less than $0.70 using Llama.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions using GPT and Llama models, but not the underlying hardware.
Software Dependencies	No	The paper mentions using specific language models (GPT-4, GPT-4o, Llama-3.1-70B-Instruct) but does not provide details on other ancillary software dependencies like programming language versions or library versions (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	Yes	A temperature of 1.0 was used to generate alternate responses for frequency scoring; a temperature of 0.0 was used for all other API calls. ... For each v V , define σ(v) = (1 β)σind(v) + βmedian{σind(v ) : v is a descendant of v}, where β is a hyperparameter. ... We swept beta values in [0, 1] and chose 0.5 for its good performance.