Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Authors: Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT.
Researcher Affiliation Academia 1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, P. R. China 2Hubei Luojia Laboratory, Wuhan 430072, P. R. China 3Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, P. R. China.
Pseudocode No The paper describes the SIFT method in Section 4 and visually represents it in Figure 5, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes The code is accessible at https://github.com/song-wx/SIFT.
Open Datasets Yes We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning... selecting the GLUE Benchmark (Wang et al., 2018) as our evaluation dataset... We adopt Llama (Touvron et al., 2023) as our backbone models, use the alpaca dataset (Taori et al., 2023) for instruction-tuning, and conduct evaluations on benchmarks such as MMLU (Hendrycks et al., 2020) and Human Eval (Chen et al., 2021).
Dataset Splits Yes We evaluate SIFT on the GLUE benchmark (Wang et al., 2018). We select Ro BERTa (Liu et al., 2019) as the backbone model for testing. The general experimental setup is consistent with (Hu et al., 2021).
Hardware Specification Yes it is possible to fine-tune a 7B model on a single RTX 3090 24GB
Software Dependencies No The paper mentions 'Py Torch (Paszke et al., 2017)' but does not provide specific version numbers for software dependencies, which is required for reproducibility.
Experiment Setup Yes The table 5 shows our hyper-parameters settings on the GLUE Benchmark experiments... OPTIMIZER ADAMW WARMUP RATIO 0.06 LR SCHEDULE LINEAR BATCH SIZE 32 EPOCHS 10 15 20 20 10 20 20 30 LEARNING RATE 7E-5 7E-5 7E-5 7E-5 5E-5 7E-5 7E-5 8E-5 WEIGHT DECAY 0.1 MAX SEQ. LENGTH 512 SPARSITY RATE 0.8% SIFT MODULES Wq, Wk, Wv, Wo