Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Conformal Language Model Reasoning with Coherent Factuality
Authors: Maxon Rubin-Toles, Maya Gambhir, Keshav Ramji, Aaron Roth, Surbhi Goel
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on mathematical reasoning problems from the MATH and FELM datasets and find that our algorithm consistently produces correct and substantiated orderings of claims, achieving coherent factuality across target coverage levels. |
| Researcher Affiliation | Collaboration | 1University of Pennsylvania, 2IBM Research AI |
| Pseudocode | Yes | Algorithm 1: Subgraph Generator Algorithm 2: Ideal Graph Assembly Algorithm 3: Coherent Calibration |
| Open Source Code | Yes | 1Code is available at https://github.com/maxrubintoles/Conformal_LM_Reasoning |
| Open Datasets | Yes | Our experiments make use of the MATH dataset (Hendrycks et al., 2021), which spans various branches of mathematics. ... We also use the FELM dataset Chen et al., 2023a |
| Dataset Splits | Yes | For each example in the calibration and test set, the algorithm requires 8 queries comprising at most 16k tokens; for our calibration set of 50 examples, this cost less than $5.00 using GPT and less than $0.70 using Llama. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running experiments. It mentions using GPT and Llama models, but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using specific language models (GPT-4, GPT-4o, Llama-3.1-70B-Instruct) but does not provide details on other ancillary software dependencies like programming language versions or library versions (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | A temperature of 1.0 was used to generate alternate responses for frequency scoring; a temperature of 0.0 was used for all other API calls. ... For each v V , define σ(v) = (1 β)σind(v) + βmedian{σind(v ) : v is a descendant of v}, where β is a hyperparameter. ... We swept beta values in [0, 1] and chose 0.5 for its good performance. |