Flextron: Many-in-One Flexible Large Language Model
Authors: Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate FLEXTRON on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining. |
| Researcher Affiliation | Collaboration | Ruisi Cai 1 2 Saurav Muralidharan 1 Greg Heinrich 1 Hongxu Yin 1 Zhangyang Wang 2 Jan Kautz 1 Pavlo Molchanov 1 1NVIDIA 2The University of Texas at Austin. |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper mentions Tensor RT-LLM with a GitHub link, but this is a third-party tool used for measurement, not the authors' open-source code for FLEXTRON. No explicit statement about making FLEXTRON's code available was found. |
| Open Datasets | Yes | We perform our evaluation on the GPT-3 and Llama-2 (Touvron et al., 2023) model families. GPT-3 ... trained on 1.1 trillion tokens, where data is obtained from publicly available data sources, comprising 53 languages and code. ... We further validate our approach using the Llama2-7B model (Touvron et al., 2023)... We additionally compare our method with representative open-source model families, including Pythia (Biderman et al., 2023), Open LLa MA (Geng & Liu, 2023)... Foundation, W. Wikimedia downloads. URL https://dumps.wikimedia.org. |
| Dataset Splits | No | The paper mentions using a 'validation loss' in Appendix A and 'validation step' in Section 4.1, implying a validation set was used, but it does not specify the explicit train/validation/test dataset splits by percentage or count in the main text or experimental settings. |
| Hardware Specification | Yes | All results are tested on the NVIDIA A100 80GB GPU, with latency measured when the prompting length and generation length is set to 8 and 512, respectively. We use the batch size of 1. |
| Software Dependencies | No | The paper mentions 'Ne Mo framework' and 'Tensor RT-LLM' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | As described in Section 3.1, during elastic network pretraining, we first perform importance sorting of each head/neuron in MHA/MLP layers using a tiny fraction (512 samples) of the full training set . We then perform training of the sorted and permuted elastic model. We use a batch-size of 256, and tune the model for 80000 steps. At each step, we randomly construct 3 sub-models together with the full model; perform gradient accumulation for all 4 models for a single update. We perform lightweight tuning for automatic network selection: we freeze the backbone parameters and only tune the routers and surrogate models for 1000 steps using a batch size of 256. |