Conformal Autoregressive Generation: Beam Search with Coverage Guarantees

Authors: Nicolas Deutschmann, Marvin Alberts, María Rodríguez Martínez

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide marginal coverage bounds for each method, and evaluate them empirically on a selection of tasks drawing from natural language processing and chemistry.
Researcher Affiliation Industry Nicolas Deutschmann, Marvin Alberts, María Rodríguez Martínez IBM Research deu@zurich.ibm.com, marvin.alberts@ibm.com, mrm@zurich.ibm.com
Pseudocode Yes Calibration Algorithm We consider N0+1 exchangeable pairs {(Xi, Si)} and a family of conformal scores σl that can be evaluated on length-l sequences. Selecting the first N0 samples as C(0) N0 , we specify a per-step confidence level 1 α calibrate iteratively as follows: at the l-th step, 1. Define k(l) α = (Nl 1 + 1)α . 2. Order the calibration set by increasing length-l scores σl X1, S1|l . . . σl XNl 1, SNl 1|l , where Si|l is the length-l truncation of Si. 3. Define t(l) α,Nl 1 = σl Xk(l) α , Sk(l) α |l 4. Set Nl = Nl 1 k(l) α , Cl Nl = {(Xi, Si)}k(l) α <i.
Open Source Code No The paper does not provide any statement about releasing source code, nor does it include links to a code repository.
Open Datasets Yes We use the USPTOMIT dataset (Jin et al. 2017) and tokenization scheme by Schwaller et al. (Schwaller et al. 2019) for training and evaluation, holding out 30k samples with length lower than 50 for calibration and testing.
Dataset Splits No For the Integer Additions task, the paper states: "We sample 130k such additions which we split into 100k training examples and 30k held out samples. On the validation set, this model has a mean coverage of 96% using 5-sequence beam search." For the Chemical Reaction Product Prediction task: "holding out 30k samples with length lower than 50 for calibration and testing. On the validation set, this model has a mean coverage of 64% using 5-sequence beam search." While validation sets are mentioned, their specific size or how they are derived from the overall data splits for reproducibility is not detailed (e.g., what percentage of the 30k held-out samples forms the validation set).
Hardware Specification No The paper mentions using transformer models (T5-base, T5-small) which typically run on GPUs, but it does not specify any details about the hardware used, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using "an off-the-shelf t5-base sequence-to-sequence model from Hugging Face" and training "a t5-small from scratch", but it does not specify version numbers for these models, libraries, or any other software dependencies.
Experiment Setup Yes For all experiments, we use the length-normalized sequence probability under the model π(S|X)/|S| as the conformal confidence score. For our benchmark tasks, we exploit a simplification of our dataset definitions: we know the maximum sequence length in advance, respectively 5 and 50 tokens for additions and chemical reactions. We use this knowledge to set the maxumum number of decoding steps to 5 and 50 and avoid discussing rare long sequences. We also specify token-wise confidence levels (1 α) [0.99, 0.98, 0.95] for additions and (1 α) [0.995, 0.99] for reactions.