Parameter-Efficient Sparsity for Large Language Models Fine-Tuning
Authors: Yuchao Li, Fuli Luo, Chuanqi Tan, Mengdi Wang, Songfang Huang, Shen Li, Junjie Bai
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with diverse networks (i.e. BERT, Ro BERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters. |
| Researcher Affiliation | Industry | Alibaba Group {laiyin.lyc, lfl259702, chuanqi.tcq, didou.wmd, songfang.hsf, litan.ls, j.bai}@alibaba-inc.com |
| Pseudocode | No | The paper describes the proposed method mathematically and conceptually but does not provide any pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/alibaba/Alice Mind/ tree/main/S4/PST and https://github.com/yuchaoli/PST. |
| Open Datasets | Yes | For BERT and Ro BERTa, we use GLUE benchmarks [Wang et al., 2018] for evaluation. For GPT-2, we evaluate it on the E2E, DART, and Web NLG. |
| Dataset Splits | Yes | For BERT and Ro BERTa, we use GLUE benchmarks [Wang et al., 2018] for evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer and a linear learning rate scheduler" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For BERTbase, we set batch size = 32 and perform a hyperparameter search over learning rate {3e-5, 5e-5, 1e-4, 5e-4} and epoch {20, 40} on QNLI, SST-2, Co LA, STS-B, MRPC, RTE and epoch {10, 20} on MNLI, QQP. Moreover, we use a batch size of 16 for Ro BERTa, as well as a hyperparameter search over learning rate {1e-5, 2e-5, 3e-5, 5e-5}. Epoch search space is the same as BERTbase. For GPT-2, we train the model for 5 epochs using a batch size of 8 and an initial learning rate of 1e-4. At training time, we use the Adam W optimizer and a linear learning rate scheduler. All models are initialized with the pre-trained weights. We follow the [Zhu and Gupta, 2018] to use a cubic sparsity scheduling. We also add a few steps of warm-up at the beginning of training (10% training steps) and cool-down at the end of training (30% training steps), which empirically improve the performance especially in high sparsity regimes. For PST, we set β = α1 = α2 = 1 and r1 = r2 = 8. |