Conformal Language Modeling

Authors: Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, Regina Barzilay

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we empirically demonstrate that we can achieve many desired coverage levels within a limited number of total samples when applying our method to multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.
Researcher Affiliation Collaboration Victor Quach1, Adam Fisch1, Tal Schuster2 Adam Yala3,4 Jae Ho Sohn4 Tommi Jaakkola1 Regina Barzilay1 1CSAIL, MIT 2Google Research 3UC Berkeley 4UCSF
Pseudocode Yes Algorithm 1 Conformal sampling with rejection
Open Source Code Yes REPRODUCIBILITY STATEMENT Code is available at https://github.com/Varal7/conformal-language-modeling.
Open Datasets Yes For the radiology report generation experiment, we utilized the labeled MIMIC-CXR and MIMIC-CXR-JPG datasets (Johnson et al., 2019). The MIMIC-CXR dataset can be accessed at https://physionet.org/content/mimic-cxr/2.0.0/ under the Physio Net Credentialed Health Data License 1.5.0. Similarly, the MIMIC-CXR-JPG dataset is available at https://physionet.org/content/mimiccxr-jpg/2.0.0/ under the same license. We use the CNN/DM dataset (Hermann et al., 2015; See et al., 2017) that includes news articles from CNN and the Daily Mail paired with their human written summaries, and is available at https://github.com/abisee/cnn-dailymail under MIT License. We use the Trivia QA (Joshi et al., 2017) dataset available at https://nlp.cs.washington.edu/triviaqa/ under the Apache License Version 2.0.
Dataset Splits Yes We start with the standard splits prescribed in MIMIC-CXR-JPG. However, we further divide the training set into a train set and a dev set using a 0.9/0.1 ratio. The train set is used for training the model, using the validation set for early stopping. We then exclusively use the dev set for conformal prediction experiments. Table F.1: Dataset statistics for preprocessed MIMIC-CXR. The splits and preprocessing scripts are available within our code release. The train and validation split is used for to train the encoder-deocder model with early stopping. The dev set is used for conformal prediction. The test set is unused.
Hardware Specification Yes We trained the model with a batch size of 128 distributed over 8 GPUs, resulting in a batch size of 16 per GPU. The total training time on 8 RTX A6000 GPUs was approximately 11 hours. We finetune the model on the train set for 200k steps with a batch size of 128 using 64 TPUv4 chips for approximately 40 hours.
Software Dependencies No The paper mentions software like "Transformers library (Wolf et al., 2019)", "GPT2-small (gpt2 on Hugging Face)", "T5-XL (Raffel et al., 2020)", "LLaMA-13B (Touvron et al., 2023)", and "CheXbert (Smit et al., 2020)". While it refers to specific models and libraries, it does not provide explicit version numbers for these software components (e.g., PyTorch 1.9, CUDA 11.1).
Experiment Setup Yes We set kmax = 20 for all experiments. We trained the model with a batch size of 128 distributed over 8 GPUs, resulting in a batch size of 16 per GPU. The Adam W optimizer was employed with β1 = 0.9, β2 = 0.999, and ϵ = 10 8. The learning rate was set to 5 10 5. The training process consisted of 10 epochs. To generate candidate responses, we use Nucleus sampling (Holtzman et al., 2020) with top-p set to 0.95, temperature 0.7, and maximum output length set to 256 tokens.