Confident Adaptive Language Modeling

Authors: Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, Donald Metzler

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute speedup of up to 3 while provably maintaining high performance.
Researcher Affiliation Collaboration Tal Schuster1, Adam Fisch2, Jai Gupta1 Mostafa Dehghani1 Dara Bahri1 Vinh Q. Tran1 Yi Tay1 Donald Metzler1 1Google Research 2CSAIL, MIT
Pseudocode Yes An Algorithm of the full procedure is provided in Appendix E.
Open Source Code Yes 1Code: https://github.com/google-research/t5x/tree/main/t5x/contrib/calm
Open Datasets Yes We empirically evaluate our methods on three popular text generation tasks that vary in their target generation length and extractive degrees against the input. CNN/DM [31] is a collection of news articles to be summarized in few sentences. WMT15 EN-FR [13] contains English sentences (one per example) to be machine translated to French. Open-book SQUAD 1.1 [54] is a QA dataset with Wikipedia paragraphs paired with questions, where the target answer is a text span from the input.
Dataset Splits Yes For each task, we use the validation and test sets to evaluate our calibration method ( 4) (for SQUAD we only use the validation set as the test answers are hidden). We run 50 random trials per target tolerance δ and consistency objective (textual or risk), where we partition the data to 80% calibration (Scal) and 20% test (Ptest).
Hardware Specification Yes Also, we compute an estimated speedup of the whole encoder-decoder model for generating the full sequence, based on TPUv3 benchmarking with 200 examples in Colab (see App. C for details).
Software Dependencies No The paper mentions using the 'T5x framework [55]' and 'JAX: composable transformations of Python+NumPy programs' but does not specify version numbers for these software dependencies or other key libraries.
Experiment Setup Yes We use the 8 layers T5 1.1 model that doesn t share input and output embeddings. We share all output embeddings for the softmax predictions, and the early-exit classifier across all decoder layers. Based on validation results, we set the temperature of our decaying threshold to = 4 for the softmax and classifier measures of CNN/DM and WMT. In other settings, we use = 0.