reproducibilityindex.ai

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Authors: Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT.
Researcher Affiliation	Academia	1National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, P. R. China 2Hubei Luojia Laboratory, Wuhan 430072, P. R. China 3Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, P. R. China.
Pseudocode	No	The paper describes the SIFT method in Section 4 and visually represents it in Figure 5, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	The code is accessible at https://github.com/song-wx/SIFT.
Open Datasets	Yes	We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning... selecting the GLUE Benchmark (Wang et al., 2018) as our evaluation dataset... We adopt Llama (Touvron et al., 2023) as our backbone models, use the alpaca dataset (Taori et al., 2023) for instruction-tuning, and conduct evaluations on benchmarks such as MMLU (Hendrycks et al., 2020) and Human Eval (Chen et al., 2021).
Dataset Splits	Yes	We evaluate SIFT on the GLUE benchmark (Wang et al., 2018). We select Ro BERTa (Liu et al., 2019) as the backbone model for testing. The general experimental setup is consistent with (Hu et al., 2021).
Hardware Specification	Yes	it is possible to fine-tune a 7B model on a single RTX 3090 24GB
Software Dependencies	No	The paper mentions 'Py Torch (Paszke et al., 2017)' but does not provide specific version numbers for software dependencies, which is required for reproducibility.
Experiment Setup	Yes	The table 5 shows our hyper-parameters settings on the GLUE Benchmark experiments... OPTIMIZER ADAMW WARMUP RATIO 0.06 LR SCHEDULE LINEAR BATCH SIZE 32 EPOCHS 10 15 20 20 10 20 20 30 LEARNING RATE 7E-5 7E-5 7E-5 7E-5 5E-5 7E-5 7E-5 8E-5 WEIGHT DECAY 0.1 MAX SEQ. LENGTH 512 SPARSITY RATE 0.8% SIFT MODULES Wq, Wk, Wv, Wo