Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Authors: Pihe Hu, Shaolong Li, Xun Wang, Longbo Huang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiment on GPT-2 showcases a FLOP reduction of 4 without compromising performance. ... We conducted extensive experiments to assess the performance of our MST method against various baseline approaches across a range of zeroshot tasks from Brown et al. (2020) and few-shot tasks from GLUE (Wang et al., 2018). ... The experiment results of zero-shot tasks are summarized in Table 3, where we also present the pretraining FLOPs of different models. ... To gain insights into the contributions of each component in MST, we conducted an ablation study on sparsity variation, topology evolution schema, and hybrid sparse attention from the training pipeline.
Researcher Affiliation Academia Pihe Hu EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University; Shaolong Li EMAIL Central South University; Xun Wang EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University; Longbo Huang EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University
Pseudocode Yes Algorithm 1 Topology Evolution by MG.
Open Source Code Yes The code is available at https://github.com/hupihe/Mixed-Sparsity-Training.
Open Datasets Yes In our performance evaluation, i.e. Table 12 in Section 4.1, we perform zero-shot evaluation of the models on 5 datasets and few-shot evaluation on 2 subtasks of GLUE. ... LAMBADA dataset (Paperno et al., 2016) ... The English Penn Treebank (PTB) corpus (Marcus et al., 1993) ... The Wiki Text language modeling dataset (Merity et al., 2016) ... The One Billion Words (1BW) dataset (Chelba et al., 2013) ... General Language Understanding Evaluation (GLUE) (Wang et al., 2018) benchmark ... RTE datasets come from a series of annual textual entailment challenges. ... The Microsoft Research Paraphrase Corpus (MRPC) (Dolan & Brockett, 2005).
Dataset Splits Yes The training set comprises 9 billion tokens, and the validation set contains 4.4 million tokens. ... In the conventional split of the corpus, sections 0 to 18 serve as the training set (comprising 38,219 sentences and 912,344 tokens), sections 19 to 21 function as the validation set (consisting of 5,527 sentences and 131,768 tokens), and sections 22 to 24 serve as the test set (comprising 5,462 sentences and 129,654 tokens).
Hardware Specification Yes Our experiments are implemented with Py Torch 2.1.1 (Paszke et al., 2017), CUDA 12.1 (Kirk et al., 2007), run on 4 NVIDIA A100 Tensor Core GPUs. ... Training is conducted with bfloat16 precision on machines with 4 A100 GPUs.
Software Dependencies Yes Our experiments are implemented with Py Torch 2.1.1 (Paszke et al., 2017), CUDA 12.1 (Kirk et al., 2007).
Experiment Setup Yes We begin by detailing the hyperparameter settings for the dense model in Table 4, sourced from Nano GPT. ... Following this, we provide the private hyperparameter settings specific to MST, Rig L, SS, and Tiny in Tables 5, 6, 7, and 8, respectively. Table 4: Public hyperparameters. Table 5: Private Hyperparameters of MST. Table 6: Private Hyperparameters of Rig L. Table 7: Private Hyperparameters of SS. Table 8: Private Hyperparameters of Tiny.