Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Parameter-Efficient Sparsity for Large Language Models Fine-Tuning
Authors: Yuchao Li, Fuli Luo, Chuanqi Tan, Mengdi Wang, Songfang Huang, Shen Li, Junjie Bai
IJCAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments with diverse networks (i.e. BERT, Ro BERTa and GPT-2) on dozens of datasets demonstrate PST performs on par or better than previous sparsity methods, despite only training a small number of parameters. |
| Researcher Affiliation | Industry | Alibaba Group EMAIL |
| Pseudocode | No | The paper describes the proposed method mathematically and conceptually but does not provide any pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/alibaba/Alice Mind/ tree/main/S4/PST and https://github.com/yuchaoli/PST. |
| Open Datasets | Yes | For BERT and Ro BERTa, we use GLUE benchmarks [Wang et al., 2018] for evaluation. For GPT-2, we evaluate it on the E2E, DART, and Web NLG. |
| Dataset Splits | Yes | For BERT and Ro BERTa, we use GLUE benchmarks [Wang et al., 2018] for evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer and a linear learning rate scheduler" but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For BERTbase, we set batch size = 32 and perform a hyperparameter search over learning rate {3e-5, 5e-5, 1e-4, 5e-4} and epoch {20, 40} on QNLI, SST-2, Co LA, STS-B, MRPC, RTE and epoch {10, 20} on MNLI, QQP. Moreover, we use a batch size of 16 for Ro BERTa, as well as a hyperparameter search over learning rate {1e-5, 2e-5, 3e-5, 5e-5}. Epoch search space is the same as BERTbase. For GPT-2, we train the model for 5 epochs using a batch size of 8 and an initial learning rate of 1e-4. At training time, we use the Adam W optimizer and a linear learning rate scheduler. All models are initialized with the pre-trained weights. We follow the [Zhu and Gupta, 2018] to use a cubic sparsity scheduling. We also add a few steps of warm-up at the beginning of training (10% training steps) and cool-down at the end of training (30% training steps), which empirically improve the performance especially in high sparsity regimes. For PST, we set β = α1 = α2 = 1 and r1 = r2 = 8. |