Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees

Authors: Yu Gui, Ying Jin, Zhimei Ren

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data.
Researcher Affiliation Academia Yu Gui1 Ying Jin2 Zhimei Ren3 1 Department of Statistics, University of Chicago 2 Data Science Initiative, Harvard University 3 Department of Statistics and Data Science, University of Pennsylvania
Pseudocode Yes Algorithm 1 Conformal Alignment
Open Source Code Yes 2The code is available at https://github.com/yugjerry/conformal-alignment.
Open Datasets Yes In this section, we implement and evaluate Conformal Alignment in question-answering tasks, where we consider a conversational question answering dataset Trivia QA [21] and a closed-book reading comprehension dataset Co QA [39]. ... Following the pipeline in Figure 1, we apply our method to (a subset of) the MIMIC-CXR dataset [20].
Dataset Splits Yes Randomly split D into two disjoint sets: the training set Dtr and the calibration set Dcal. ... Fixing γ1, γ2 (0, 1), γ1 + γ2 < 1, we randomly sample (γ1 + γ2) |D| instances without replacement from D as Dtr ... For the results presented in this section, γ1 = 0.2, γ2 = 0.5.
Hardware Specification Yes The training process takes about 10 hours on one NVIDIA A100 GPU.
Software Dependencies No The implemented OPT-13B model is from Hugging Face https://huggingface.co/facebook/opt-13b and the implemented LLa MA-2-13B-chat is from https://llama.meta.com. We utilize an off-the-shelf De BERTa-large model [13] as the NLI classifier to calculate similarities.
Experiment Setup Yes For each QA dataset, we use language models OPT-13B [57] and LLa MA-2-13B-chat [47] without finetuning to generate an answer f(Xi) via top-p sampling for each input Xi following the default configuration. ... in specific, we use num_beams=1, do_sample=True, top_p=1.0, top_k=0, temperature=1.0. ... In particular, each raw image is resized to 224 224 pixels. We then fine-tune the model on a hold-out dataset with a sample size of 43, 300 for 10 epochs with a batch size of 8, and other hyperparameters are set to default values.