Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining
Authors: Pihe Hu, Shaolong Li, Xun Wang, Longbo Huang
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiment on GPT-2 showcases a FLOP reduction of 4 without compromising performance. ... We conducted extensive experiments to assess the performance of our MST method against various baseline approaches across a range of zeroshot tasks from Brown et al. (2020) and few-shot tasks from GLUE (Wang et al., 2018). ... The experiment results of zero-shot tasks are summarized in Table 3, where we also present the pretraining FLOPs of different models. ... To gain insights into the contributions of each component in MST, we conducted an ablation study on sparsity variation, topology evolution schema, and hybrid sparse attention from the training pipeline. |
| Researcher Affiliation | Academia | Pihe Hu EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University; Shaolong Li EMAIL Central South University; Xun Wang EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University; Longbo Huang EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Topology Evolution by MG. |
| Open Source Code | Yes | The code is available at https://github.com/hupihe/Mixed-Sparsity-Training. |
| Open Datasets | Yes | In our performance evaluation, i.e. Table 12 in Section 4.1, we perform zero-shot evaluation of the models on 5 datasets and few-shot evaluation on 2 subtasks of GLUE. ... LAMBADA dataset (Paperno et al., 2016) ... The English Penn Treebank (PTB) corpus (Marcus et al., 1993) ... The Wiki Text language modeling dataset (Merity et al., 2016) ... The One Billion Words (1BW) dataset (Chelba et al., 2013) ... General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark ... RTE datasets come from a series of annual textual entailment challenges. ... The Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005). |
| Dataset Splits | Yes | The training set comprises 9 billion tokens, and the validation set contains 4.4 million tokens. ... In the conventional split of the corpus, sections 0 to 18 serve as the training set (comprising 38,219 sentences and 912,344 tokens), sections 19 to 21 function as the validation set (consisting of 5,527 sentences and 131,768 tokens), and sections 22 to 24 serve as the test set (comprising 5,462 sentences and 129,654 tokens). |
| Hardware Specification | Yes | Our experiments are implemented with Py Torch 2.1.1 (Paszke et al., 2017), CUDA 12.1 (Kirk et al., 2007), run on 4 NVIDIA A100 Tensor Core GPUs. ... Training is conducted with bfloat16 precision on machines with 4 A100 GPUs. |
| Software Dependencies | Yes | Our experiments are implemented with Py Torch 2.1.1 (Paszke et al., 2017), CUDA 12.1 (Kirk et al., 2007). |
| Experiment Setup | Yes | We begin by detailing the hyperparameter settings for the dense model in Table 4, sourced from Nano GPT. ... Following this, we provide the private hyperparameter settings specific to MST, Rig L, SS, and Tiny in Tables 5, 6, 7, and 8, respectively. Table 4: Public hyperparameters. Table 5: Private Hyperparameters of MST. Table 6: Private Hyperparameters of Rig L. Table 7: Private Hyperparameters of SS. Table 8: Private Hyperparameters of Tiny. |