Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Confident Adaptive Language Modeling
Authors: Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, Donald Metzler
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute speedup of up to 3 while provably maintaining high performance. |
| Researcher Affiliation | Collaboration | Tal Schuster1, Adam Fisch2, Jai Gupta1 Mostafa Dehghani1 Dara Bahri1 Vinh Q. Tran1 Yi Tay1 Donald Metzler1 1Google Research 2CSAIL, MIT |
| Pseudocode | Yes | An Algorithm of the full procedure is provided in Appendix E. |
| Open Source Code | Yes | 1Code: https://github.com/google-research/t5x/tree/main/t5x/contrib/calm |
| Open Datasets | Yes | We empirically evaluate our methods on three popular text generation tasks that vary in their target generation length and extractive degrees against the input. CNN/DM [31] is a collection of news articles to be summarized in few sentences. WMT15 EN-FR [13] contains English sentences (one per example) to be machine translated to French. Open-book SQUAD 1.1 [54] is a QA dataset with Wikipedia paragraphs paired with questions, where the target answer is a text span from the input. |
| Dataset Splits | Yes | For each task, we use the validation and test sets to evaluate our calibration method ( 4) (for SQUAD we only use the validation set as the test answers are hidden). We run 50 random trials per target tolerance δ and consistency objective (textual or risk), where we partition the data to 80% calibration (Scal) and 20% test (Ptest). |
| Hardware Specification | Yes | Also, we compute an estimated speedup of the whole encoder-decoder model for generating the full sequence, based on TPUv3 benchmarking with 200 examples in Colab (see App. C for details). |
| Software Dependencies | No | The paper mentions using the 'T5x framework [55]' and 'JAX: composable transformations of Python+NumPy programs' but does not specify version numbers for these software dependencies or other key libraries. |
| Experiment Setup | Yes | We use the 8 layers T5 1.1 model that doesn t share input and output embeddings. We share all output embeddings for the softmax predictions, and the early-exit classifier across all decoder layers. Based on validation results, we set the temperature of our decaying threshold to = 4 for the softmax and classifier measures of CNN/DM and WMT. In other settings, we use = 0. |