Large language model validity via enhanced conformal prediction methods

Authors: John Cherian, Isaac Gibbs, Emmanuel Candes

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our approach on biography and medical question-answering datasets.
Researcher Affiliation Academia John J. Cherian Department of Statistics Stanford University jcherian@stanford.edu Isaac Gibbs Department of Statistics Stanford University igibbs@stanford.edu Emmanuel J. Candès Department of Statistics Department of Mathematics Stanford University candes@stanford.edu
Pseudocode Yes Algorithm 1, which presents a complete description of the procedure, can be found in Appendix C.
Open Source Code Yes We release a filtered version of the Med LFQA benchmark that removes some non-health-related prompts, the generated and parsed text used to run our experiments, as well as the notebooks used to produce the figures in this paper at github.com/jjcherian/conformal-safety. We also update our Python package for conditional conformal inference to support level-adaptive conformal prediction. This package is available to download at github.com/jjcherian/conditional-conformal and can be installed from Py PI.
Open Datasets Yes Our experiment considers the long-form medical question-answering dataset (Med LFQA) published in Jeong et al. [13]. It combines several previously established benchmarks in the medical question-answering literature: Health Search QA (n = 3047), K-QA (n = 1077), Live QA (n = 100) and Medication QA (n = 627). Each prompt in these datasets is also accompanied by either an LLM or human-generated response to each question [6, 1, 19, 26].
Dataset Splits Yes To do this, we split our training data into two folds: one is used to estimate α( ) and the other is used to run the calibration method described above. After making these splits, α( ) can be learned using any regression method. In our experiments, we will aim to learn the smallest possible values of α( ) that meet our target quality criterion. A detailed description of our procedure for doing so is given in Appendix B.1.
Hardware Specification No The majority of the experiments performed in this paper were run on a Mac Book Pro in a total runtime of less than a few hours. Since this is not a significant use of computational resources, we do not include this information in the paper. The remaining computational burden of this paper consists of calls to the Open AI GPT API, which again are relatively low cost and take only a few hours to run.
Software Dependencies No The paper mentions a 'Python package' and installation from 'Py PI' but does not provide specific version numbers for Python or any other libraries used in the experiments.
Experiment Setup Yes The function class F is defined by the linear combination of an intercept, the number of characters in the prompt, the number of characters in the response, the mean frequency score assigned to the claims, the standard deviation of the frequency scores assigned to the claims, and group-indicators corresponding to the source dataset. The results shown in Figure 2 are obtained by running the conditional boosting algorithm for 1000 steps using the Adam optimizer with learning rate set to 0.001.