Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees
Authors: Yu Gui, Ying Jin, Zhimei Ren
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. |
| Researcher Affiliation | Academia | Yu Gui1 Ying Jin2 Zhimei Ren3 1 Department of Statistics, University of Chicago 2 Data Science Initiative, Harvard University 3 Department of Statistics and Data Science, University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 Conformal Alignment |
| Open Source Code | Yes | 2The code is available at https://github.com/yugjerry/conformal-alignment. |
| Open Datasets | Yes | In this section, we implement and evaluate Conformal Alignment in question-answering tasks, where we consider a conversational question answering dataset Trivia QA [21] and a closed-book reading comprehension dataset Co QA [39]. ... Following the pipeline in Figure 1, we apply our method to (a subset of) the MIMIC-CXR dataset [20]. |
| Dataset Splits | Yes | Randomly split D into two disjoint sets: the training set Dtr and the calibration set Dcal. ... Fixing γ1, γ2 (0, 1), γ1 + γ2 < 1, we randomly sample (γ1 + γ2) |D| instances without replacement from D as Dtr ... For the results presented in this section, γ1 = 0.2, γ2 = 0.5. |
| Hardware Specification | Yes | The training process takes about 10 hours on one NVIDIA A100 GPU. |
| Software Dependencies | No | The implemented OPT-13B model is from Hugging Face https://huggingface.co/facebook/opt-13b and the implemented LLa MA-2-13B-chat is from https://llama.meta.com. We utilize an off-the-shelf De BERTa-large model [13] as the NLI classifier to calculate similarities. |
| Experiment Setup | Yes | For each QA dataset, we use language models OPT-13B [57] and LLa MA-2-13B-chat [47] without finetuning to generate an answer f(Xi) via top-p sampling for each input Xi following the default configuration. ... in specific, we use num_beams=1, do_sample=True, top_p=1.0, top_k=0, temperature=1.0. ... In particular, each raw image is resized to 224 224 pixels. We then fine-tune the model on a hold-out dataset with a sample size of 43, 300 for 10 epochs with a batch size of 8, and other hyperparameters are set to default values. |