reproducibilityindex.ai

Flextron: Many-in-One Flexible Large Language Model

Authors: Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate FLEXTRON on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
Researcher Affiliation	Collaboration	Ruisi Cai 1 2 Saurav Muralidharan 1 Greg Heinrich 1 Hongxu Yin 1 Zhangyang Wang 2 Jan Kautz 1 Pavlo Molchanov 1 1NVIDIA 2The University of Texas at Austin.
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper mentions Tensor RT-LLM with a GitHub link, but this is a third-party tool used for measurement, not the authors' open-source code for FLEXTRON. No explicit statement about making FLEXTRON's code available was found.
Open Datasets	Yes	We perform our evaluation on the GPT-3 and Llama-2 (Touvron et al., 2023) model families. GPT-3 ... trained on 1.1 trillion tokens, where data is obtained from publicly available data sources, comprising 53 languages and code. ... We further validate our approach using the Llama2-7B model (Touvron et al., 2023)... We additionally compare our method with representative open-source model families, including Pythia (Biderman et al., 2023), Open LLa MA (Geng & Liu, 2023)... Foundation, W. Wikimedia downloads. URL https://dumps.wikimedia.org.
Dataset Splits	No	The paper mentions using a 'validation loss' in Appendix A and 'validation step' in Section 4.1, implying a validation set was used, but it does not specify the explicit train/validation/test dataset splits by percentage or count in the main text or experimental settings.
Hardware Specification	Yes	All results are tested on the NVIDIA A100 80GB GPU, with latency measured when the prompting length and generation length is set to 8 and 512, respectively. We use the batch size of 1.
Software Dependencies	No	The paper mentions 'Ne Mo framework' and 'Tensor RT-LLM' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	As described in Section 3.1, during elastic network pretraining, we first perform importance sorting of each head/neuron in MHA/MLP layers using a tiny fraction (512 samples) of the full training set . We then perform training of the sorted and permuted elastic model. We use a batch-size of 256, and tune the model for 80000 steps. At each step, we randomly construct 3 sub-models together with the full model; perform gradient accumulation for all 4 models for a single update. We perform lightweight tuning for automatic network selection: we freeze the backbone parameters and only tune the routers and surrogate models for 1000 steps using a batch size of 256.