Tandem Transformers for Inference Efficient LLMs

Authors: Aishwarya P S, Pranav Ajit Nair, Yashas Samaga B L, Toby James Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive latency evaluation on TPUv5e for both standa alone and SPEED versions of Tandem (Pa LM2Bison, Pa LM2-Gecko) with Pa LM2-Bison and Pa LM2Gecko being the primary ML and secondary MS model, respectively. In particular, on multiple datasets, we observe that Tandem + SPEED with distillation can be at least 2.19 faster than the baseline Pa LM2-Bison model while ensuring same output quality.
Researcher Affiliation Industry 1Google Deep Mind 2Google Research, New York City.
Pseudocode No The paper provides mathematical equations and figures describing the model's operation but does not include a dedicated 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not provide any statement or link regarding the release of open-source code for the described methodology.
Open Datasets Yes On the Pa LM2 pretraining dataset, a Tandem of Pa LM2Bison and Pa LM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone Pa LM2-Gecko, offering a 1.16x speedup compared to a Pa LM2-Otter model with comparable downstream performance. ... For downstream task evaluation, we compare on Super GLUE (Wang et al., 2019), Tydi QA (Clark et al., 2020), a large collection of generation tasks, which we call Gen-tasks (comprising of SQu ADv2 (Rajpurkar et al., 2018), Natural Questions (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017), Web Questions (Berant et al., 2013) and Lambada (Paperno et al., 2016)), MBPP (Austin et al., 2021), and WMT22 (Zerva et al., 2022).
Dataset Splits No The paper mentions using pretrained checkpoints and continuing pretraining, and for evaluation, it references standard benchmarks and settings from previous work (Anil et al., 2023) for Super GLUE and Gen-tasks, and uses 1-shot evaluations. However, it does not explicitly state the train/validation/test splits with percentages or sample counts for its own experiments within the paper's text.
Hardware Specification Yes All the evaluations are performed on TPUv5e (Cloud).
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, CUDA versions) needed to replicate the experiment.
Experiment Setup Yes Both the Tandem models Tandem-CE and Tandem-Distil are trained with a block length of γ = 2. ... We set τ = 0.8 as the threshold to determine if MS can continue generating more tokens.