Confident Adaptive Language Modeling
Authors: Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, Donald Metzler
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute speedup of up to 3 while provably maintaining high performance. |
| Researcher Affiliation | Collaboration | Tal Schuster1, Adam Fisch2, Jai Gupta1 Mostafa Dehghani1 Dara Bahri1 Vinh Q. Tran1 Yi Tay1 Donald Metzler1 1Google Research 2CSAIL, MIT |
| Pseudocode | Yes | An Algorithm of the full procedure is provided in Appendix E. |
| Open Source Code | Yes | 1Code: https://github.com/google-research/t5x/tree/main/t5x/contrib/calm |
| Open Datasets | Yes | We empirically evaluate our methods on three popular text generation tasks that vary in their target generation length and extractive degrees against the input. CNN/DM [31] is a collection of news articles to be summarized in few sentences. WMT15 EN-FR [13] contains English sentences (one per example) to be machine translated to French. Open-book SQUAD 1.1 [54] is a QA dataset with Wikipedia paragraphs paired with questions, where the target answer is a text span from the input. |
| Dataset Splits | Yes | For each task, we use the validation and test sets to evaluate our calibration method ( 4) (for SQUAD we only use the validation set as the test answers are hidden). We run 50 random trials per target tolerance δ and consistency objective (textual or risk), where we partition the data to 80% calibration (Scal) and 20% test (Ptest). |
| Hardware Specification | Yes | Also, we compute an estimated speedup of the whole encoder-decoder model for generating the full sequence, based on TPUv3 benchmarking with 200 examples in Colab (see App. C for details). |
| Software Dependencies | No | The paper mentions using the 'T5x framework [55]' and 'JAX: composable transformations of Python+NumPy programs' but does not specify version numbers for these software dependencies or other key libraries. |
| Experiment Setup | Yes | We use the 8 layers T5 1.1 model that doesn t share input and output embeddings. We share all output embeddings for the softmax predictions, and the early-exit classifier across all decoder layers. Based on validation results, we set the temperature of our decaying threshold to = 4 for the softmax and classifier measures of CNN/DM and WMT. In other settings, we use = 0. |