Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformal Language Model Reasoning with Coherent Factuality

Authors: Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, Surbhi Goel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels.
Researcher Affiliation Collaboration 1University of Pennsylvania, 2IBM Research AI
Pseudocode Yes Algorithm 1: Subgraph Generator Algorithm 2: Ideal Graph Assembly Algorithm 3: Coherent Calibration
Open Source Code Yes 1Code is available at https://github.com/maxrubintoles/Conformal_LM_Reasoning
Open Datasets Yes Our experiments make use of the MATH dataset (Hendrycks et al., 2021), which spans various branches of mathematics. ... We also use the FELM dataset Chen et al., 2023a
Dataset Splits Yes For each example in the calibration and test set, the algorithm requires 8 queries comprising at most 16k tokens; for our calibration set of 50 examples, this cost less than $5.00 using GPT and less than $0.70 using Llama.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions using GPT and Llama models, but not the underlying hardware.
Software Dependencies No The paper mentions using specific language models (GPT-4, GPT-4o, Llama-3.1-70B-Instruct) but does not provide details on other ancillary software dependencies like programming language versions or library versions (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes A temperature of 1.0 was used to generate alternate responses for frequency scoring; a temperature of 0.0 was used for all other API calls. ... For each v V , define σ(v) = (1 β)σind(v) + βmedian{σind(v ) : v is a descendant of v}, where β is a hyperparameter. ... We swept beta values in [0, 1] and chose 0.5 for its good performance.