reproducibilityindex.ai

Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees

Authors: Yu Gui, Ying Jin, Zhimei Ren

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data.
Researcher Affiliation	Academia	Yu Gui1 Ying Jin2 Zhimei Ren3 1 Department of Statistics, University of Chicago 2 Data Science Initiative, Harvard University 3 Department of Statistics and Data Science, University of Pennsylvania
Pseudocode	Yes	Algorithm 1 Conformal Alignment
Open Source Code	Yes	2The code is available at https://github.com/yugjerry/conformal-alignment.
Open Datasets	Yes	In this section, we implement and evaluate Conformal Alignment in question-answering tasks, where we consider a conversational question answering dataset Trivia QA [21] and a closed-book reading comprehension dataset Co QA [39]. ... Following the pipeline in Figure 1, we apply our method to (a subset of) the MIMIC-CXR dataset [20].
Dataset Splits	Yes	Randomly split D into two disjoint sets: the training set Dtr and the calibration set Dcal. ... Fixing γ1, γ2 (0, 1), γ1 + γ2 < 1, we randomly sample (γ1 + γ2) \|D\| instances without replacement from D as Dtr ... For the results presented in this section, γ1 = 0.2, γ2 = 0.5.
Hardware Specification	Yes	The training process takes about 10 hours on one NVIDIA A100 GPU.
Software Dependencies	No	The implemented OPT-13B model is from Hugging Face https://huggingface.co/facebook/opt-13b and the implemented LLa MA-2-13B-chat is from https://llama.meta.com. We utilize an off-the-shelf De BERTa-large model [13] as the NLI classifier to calculate similarities.
Experiment Setup	Yes	For each QA dataset, we use language models OPT-13B [57] and LLa MA-2-13B-chat [47] without finetuning to generate an answer f(Xi) via top-p sampling for each input Xi following the default configuration. ... in specific, we use num_beams=1, do_sample=True, top_p=1.0, top_k=0, temperature=1.0. ... In particular, each raw image is resized to 224 224 pixels. We then fine-tune the model on a hold-out dataset with a sample size of 43, 300 for 10 epochs with a batch size of 8, and other hyperparameters are set to default values.