reproducibilityindex.ai

Exploring the Benefit of Activation Sparsity in Pre-training

Authors: Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Jie Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on three representative text models, including GPT (Radford et al., 2019), BERT (Devlin et al., 2019), and T5 (Raffel et al., 2020), with different architectures and pre-training objectives.
Researcher Affiliation	Collaboration	1NLP Group, DCST, IAI, BNRIST, Tsinghua University 2Gaoling School of Artiﬁcial Intelligence, Renmin University of China 3Tencent 4Jiangsu Collaborative Innovation Center for Language Ability, Xuzhou, China.
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	Codes are available at https: //github.com/thunlp/moefication.
Open Datasets	Yes	Pre-training corpus: We use the Pile dataset (Gao et al., 2021a) as the pre-training corpus.
Dataset Splits	Yes	We save the model checkpoints every 4, 000 steps and calculate the activation sparsity of each checkpoint on the validation corpus.
Hardware Specification	Yes	The inference time is measured on a single NVIDIA RTX 3090 GPU... We use four NVIDIA A800 GPU for training...
Software Dependencies	No	The paper mentions several software components like 'Adam optimizer', 'Noam learning rate scheduler', 'Scatter Mo E library', 'Mega Blocks framework', and 'faiss-gpu library', but does not provide specific version numbers for any of them.
Experiment Setup	Yes	The training epoch is set to 10, which contains about 200,000 steps, and the warmup steps are set to 2,000. The batch size is set to 512 and the learning rate is set to 1 for BERT and T5 and 0.5 for GPT. The mask rate of MLM is set to 0.15. For Mo E-Dropout and SSD, we set the number of experts to 32 and the number of selected experts to 6. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps.