Parameter-Efficient Sparsity for Large Language Models Fine-Tuning

Authors: Yuchao Li, Fuli Luo, Chuanqi Tan, Mengdi Wang, Songfang Huang, Shen Li, Junjie Bai

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments with diverse networks (i.e. BERT, Ro BERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters.
Researcher Affiliation Industry Alibaba Group {laiyin.lyc, lfl259702, chuanqi.tcq, didou.wmd, songfang.hsf, litan.ls, j.bai}@alibaba-inc.com
Pseudocode No The paper describes the proposed method mathematically and conceptually but does not provide any pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Our code is available at https://github.com/alibaba/Alice Mind/ tree/main/S4/PST and https://github.com/yuchaoli/PST.
Open Datasets Yes For BERT and Ro BERTa, we use GLUE benchmarks [Wang et al., 2018] for evaluation. For GPT-2, we evaluate it on the E2E, DART, and Web NLG.
Dataset Splits Yes For BERT and Ro BERTa, we use GLUE benchmarks [Wang et al., 2018] for evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type) used for running the experiments.
Software Dependencies No The paper mentions using "Adam W optimizer and a linear learning rate scheduler" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For BERTbase, we set batch size = 32 and perform a hyperparameter search over learning rate {3e-5, 5e-5, 1e-4, 5e-4} and epoch {20, 40} on QNLI, SST-2, Co LA, STS-B, MRPC, RTE and epoch {10, 20} on MNLI, QQP. Moreover, we use a batch size of 16 for Ro BERTa, as well as a hyperparameter search over learning rate {1e-5, 2e-5, 3e-5, 5e-5}. Epoch search space is the same as BERTbase. For GPT-2, we train the model for 5 epochs using a batch size of 8 and an initial learning rate of 1e-4. At training time, we use the Adam W optimizer and a linear learning rate scheduler. All models are initialized with the pre-trained weights. We follow the [Zhu and Gupta, 2018] to use a cubic sparsity scheduling. We also add a few steps of warm-up at the beginning of training (10% training steps) and cool-down at the end of training (30% training steps), which empirically improve the performance especially in high sparsity regimes. For PST, we set β = α1 = α2 = 1 and r1 = r2 = 8.