Efficient Conformal Prediction via Cascaded Inference with Expanded Admission
Authors: Adam Fisch, Tal Schuster, Tommi S. Jaakkola, Regina Barzilay
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the empirical effectiveness of our approach for multiple applications in natural language processing and computational chemistry for drug discovery. We empirically validate our approach on information retrieval for fact verification, open-domain question answering, and in-silico screening for drug discovery. We empirically evaluate our method on three different tasks with standard, publicly available datasets. |
| Researcher Affiliation | Academia | Adam Fisch Tal Schuster Tommi Jaakkola Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {fisch,tals,tommi,regina}@csail.mit.edu |
| Pseudocode | Yes | Algorithm 1 Cascaded inductive conformal prediction with distribution-free marginal coverage. |
| Open Source Code | Yes | Our code is available at https://github.com/ajfisch/conformal-cascades. |
| Open Datasets | Yes | We use the open-domain setting of the Natural Questions dataset (Kwiatkowski et al., 2019). We use the FEVER dataset (Thorne et al., 2018). Using the Ch EMBL database (Mayr et al., 2018). |
| Dataset Splits | Yes | For each task, we use a proper training, validation, and test set. We retain 6750/8757 questions from the validation set and 2895/3610 from the test set. We follow the dataset splits of the Eraser benchmark (De Young et al., 2020) that contain 97,957 claims for training, 6,122 claims for validation, and 6,111 claims for test. We split the Ch EMBL dataset into a 60-20-20 split of molecules, where 60% of molecules are separated into a train set, 20% into a validation set, and 20% into a test set. |
| Hardware Specification | No | The paper states: 'In this work we do not measure wall-clock times as these are hardware-specific, and depend heavily on optimized implementations.' It also mentions 'even on a single CPU' for the RF model, but no specific CPU or GPU models are provided. |
| Software Dependencies | No | The paper mentions software like 'Gensim library', 'ALBERT-Base', 'chemprop repository', and 'Scikit library' but does not provide specific version numbers for these dependencies, which are required for reproducibility. |
| Experiment Setup | Yes | We perform model selection specifically for CP on the validation set, and report final numbers on the test set. The QA and IR cascades use the Simes correction for MHT, while the DR cascades uses the Bonferroni correction. For each token, the model outputs independent scores for being the start or end of the answer span. We also follow Karpukhin et al. (2020) by using the output of the [CLS] token to get a passage selection score from the reader model. We collect 10 negative pairs for each positive one by randomly selecting other sentences from the same article as the correct evidence. We limit the number of negative samples (incorrect answers) to the top 64 incorrect predictions of the EXT model. The final prediction is based on an ensemble of 5 models, trained with different random seeds. |