Flextron: Many-in-One Flexible Large Language Model

Authors: Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate FLEXTRON on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
Researcher Affiliation Collaboration Ruisi Cai 1 2 Saurav Muralidharan 1 Greg Heinrich 1 Hongxu Yin 1 Zhangyang Wang 2 Jan Kautz 1 Pavlo Molchanov 1 1NVIDIA 2The University of Texas at Austin.
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper mentions Tensor RT-LLM with a GitHub link, but this is a third-party tool used for measurement, not the authors' open-source code for FLEXTRON. No explicit statement about making FLEXTRON's code available was found.
Open Datasets Yes We perform our evaluation on the GPT-3 and Llama-2 (Touvron et al., 2023) model families. GPT-3 ... trained on 1.1 trillion tokens, where data is obtained from publicly available data sources, comprising 53 languages and code. ... We further validate our approach using the Llama2-7B model (Touvron et al., 2023)... We additionally compare our method with representative open-source model families, including Pythia (Biderman et al., 2023), Open LLa MA (Geng & Liu, 2023)... Foundation, W. Wikimedia downloads. URL https://dumps.wikimedia.org.
Dataset Splits No The paper mentions using a 'validation loss' in Appendix A and 'validation step' in Section 4.1, implying a validation set was used, but it does not specify the explicit train/validation/test dataset splits by percentage or count in the main text or experimental settings.
Hardware Specification Yes All results are tested on the NVIDIA A100 80GB GPU, with latency measured when the prompting length and generation length is set to 8 and 512, respectively. We use the batch size of 1.
Software Dependencies No The paper mentions 'Ne Mo framework' and 'Tensor RT-LLM' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes As described in Section 3.1, during elastic network pretraining, we first perform importance sorting of each head/neuron in MHA/MLP layers using a tiny fraction (512 samples) of the full training set . We then perform training of the sorted and permuted elastic model. We use a batch-size of 256, and tune the model for 80000 steps. At each step, we randomly construct 3 sub-models together with the full model; perform gradient accumulation for all 4 models for a single update. We perform lightweight tuning for automatic network selection: we freeze the backbone parameters and only tune the routers and surrogate models for 1000 steps using a batch size of 256.