Tandem Transformers for Inference Efficient LLMs
Authors: Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive latency evaluation on TPUv5e for both standa alone and SPEED versions of Tandem (Pa LM2Bison, Pa LM2-Gecko) with Pa LM2-Bison and Pa LM2Gecko being the primary ML and secondary MS model, respectively. In particular, on multiple datasets, we observe that Tandem + SPEED with distillation can be at least 2.19 faster than the baseline Pa LM2-Bison model while ensuring same output quality. |
| Researcher Affiliation | Industry | 1Google Deep Mind 2Google Research, New York City. |
| Pseudocode | No | The paper provides mathematical equations and figures describing the model's operation but does not include a dedicated 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper does not provide any statement or link regarding the release of open-source code for the described methodology. |
| Open Datasets | Yes | On the Pa LM2 pretraining dataset, a Tandem of Pa LM2Bison and Pa LM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone Pa LM2-Gecko, offering a 1.16x speedup compared to a Pa LM2-Otter model with comparable downstream performance. ... For downstream task evaluation, we compare on Super GLUE (Wang et al., 2019), Tydi QA (Clark et al., 2020), a large collection of generation tasks, which we call Gen-tasks (comprising of SQu ADv2 (Rajpurkar et al., 2018), Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017), Web Questions (Berant et al., 2013) and Lambada (Paperno et al., 2016)), MBPP (Austin et al., 2021), and WMT22 (Zerva et al., 2022). |
| Dataset Splits | No | The paper mentions using pretrained checkpoints and continuing pretraining, and for evaluation, it references standard benchmarks and settings from previous work (Anil et al., 2023) for Super GLUE and Gen-tasks, and uses 1-shot evaluations. However, it does not explicitly state the train/validation/test splits with percentages or sample counts for its own experiments within the paper's text. |
| Hardware Specification | Yes | All the evaluations are performed on TPUv5e (Cloud). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, CUDA versions) needed to replicate the experiment. |
| Experiment Setup | Yes | Both the Tandem models Tandem-CE and Tandem-Distil are trained with a block length of γ = 2. ... We set τ = 0.8 as the threshold to determine if MS can continue generating more tokens. |