Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees
Authors: Yu Gui, Ying Jin, Zhimei Ren
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. |
| Researcher Affiliation | Academia | Yu Gui1 Ying Jin2 Zhimei Ren3 1 Department of Statistics, University of Chicago 2 Data Science Initiative, Harvard University 3 Department of Statistics and Data Science, University of Pennsylvania |
| Pseudocode | Yes | Algorithm 1 Conformal Alignment |
| Open Source Code | Yes | 2The code is available at https://github.com/yugjerry/conformal-alignment. |
| Open Datasets | Yes | In this section, we implement and evaluate Conformal Alignment in question-answering tasks, where we consider a conversational question answering dataset Trivia QA [21] and a closed-book reading comprehension dataset Co QA [39]. ... Following the pipeline in Figure 1, we apply our method to (a subset of) the MIMIC-CXR dataset [20]. |
| Dataset Splits | Yes | Randomly split D into two disjoint sets: the training set Dtr and the calibration set Dcal. ... Fixing γ1, γ2 (0, 1), γ1 + γ2 < 1, we randomly sample (γ1 + γ2) |D| instances without replacement from D as Dtr ... For the results presented in this section, γ1 = 0.2, γ2 = 0.5. |
| Hardware Specification | Yes | The training process takes about 10 hours on one NVIDIA A100 GPU. |
| Software Dependencies | No | The implemented OPT-13B model is from Hugging Face https://huggingface.co/facebook/opt-13b and the implemented LLa MA-2-13B-chat is from https://llama.meta.com. We utilize an off-the-shelf De BERTa-large model [13] as the NLI classifier to calculate similarities. |
| Experiment Setup | Yes | For each QA dataset, we use language models OPT-13B [57] and LLa MA-2-13B-chat [47] without finetuning to generate an answer f(Xi) via top-p sampling for each input Xi following the default configuration. ... in specific, we use num_beams=1, do_sample=True, top_p=1.0, top_k=0, temperature=1.0. ... In particular, each raw image is resized to 224 224 pixels. We then fine-tune the model on a hold-out dataset with a sample size of 43, 300 for 10 epochs with a batch size of 8, and other hyperparameters are set to default values. |