Accelerating Transformer Pre-training with 2:4 Sparsity

Authors: Yuezhou Hu, Kang Zhao, Weiyu Huang, Jianfei Chen, Jun Zhu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our 2:4 sparse training algorithm achieves similar convergence to dense training algorithms on several transformer pre-training tasks, while actual acceleration can be observed on different shapes of transformer block apparently.
Researcher Affiliation Academia 1Dept. of Comp. Sci. & Tech., Institute for AI, BNRist Center, Tsinghua-Bosch Joint ML Center, THBI Lab, Tsinghua University. Correspondence to: Jianfei Chen <jianfeic@tsinghua.edu.cn>.
Pseudocode Yes Algorithm 1 transposable mask search Input: mask pattern m , weight matrix W 1. W = abs(W) 2. out = conv2d(W, m , stride = 4, padding = 0) 3. index = argmax(out, dim = 2) return index
Open Source Code Yes Our toolkit is available at https://github.com/huyz2023/ 2by4-pretrain.
Open Datasets Yes For BERT, we use Cramming (Geiping & Goldstein, 2022) to pre-train a 16-layer BERT model with the sequence length of 512 on the C4 dataset (Raffel et al., 2019). For GPT-2, we use nano GPT (Karpathy, 2023) to pre-train GPT-2 124M, 355M, 774M, and 1.5B on Open Web Text (Gokaslan & Cohen, 2019). Both BERT and GPT-2 models are estimated on GLUE (Wang et al., 2018). For Dei T (Touvron et al., 2021a), we pre-train Dei T-tiny on Image Net-1K dataset (Deng et al., 2009). Besides, we use fairseq (Ott et al., 2019) to train Transformer-base on the WMT 14 En-De dataset (Bojar et al., 2014) and measure the BLEU (Papineni et al., 2002) score of the trained model.
Dataset Splits No The paper uses validation loss and metrics (e.g., VAL LOSS in Table 1, 6, 9) implying the use of validation sets, but it does not explicitly state the specific dataset split percentages, sample counts, or the methodology used to generate these splits for reproducibility.
Hardware Specification Yes The training acceleration techniques proposed in Section 5 are evaluated using GPT-2 models and RTX3090 GPUs.
Software Dependencies No For 2:4-sp MMs, we use CUTLASS (Thakkar et al., 2023). Other GPU kernels are implemented in Triton, including transposable mask search kernel, pruning kernel, MVUE kernel, GEGLU kernel, and masked decay kernel.
Experiment Setup Yes For BERT, we use Cramming (Geiping & Goldstein, 2022) to pre-train a 16-layer BERT model with the sequence length of 512 on the C4 dataset (Raffel et al., 2019). ... FP16 mixed precision training is used on all models. ... We usually take l = 40 in practice. ... determine that our dense fine-tuning takes up the last 1/6 of total steps.