Exploring the Benefit of Activation Sparsity in Pre-training

Authors: Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Jie Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three representative text models, including GPT (Radford et al., 2019), BERT (Devlin et al., 2019), and T5 (Raffel et al., 2020), with different architectures and pre-training objectives.
Researcher Affiliation Collaboration 1NLP Group, DCST, IAI, BNRIST, Tsinghua University 2Gaoling School of Artificial Intelligence, Renmin University of China 3Tencent 4Jiangsu Collaborative Innovation Center for Language Ability, Xuzhou, China.
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Codes are available at https: //github.com/thunlp/moefication.
Open Datasets Yes Pre-training corpus: We use the Pile dataset (Gao et al., 2021a) as the pre-training corpus.
Dataset Splits Yes We save the model checkpoints every 4, 000 steps and calculate the activation sparsity of each checkpoint on the validation corpus.
Hardware Specification Yes The inference time is measured on a single NVIDIA RTX 3090 GPU... We use four NVIDIA A800 GPU for training...
Software Dependencies No The paper mentions several software components like 'Adam optimizer', 'Noam learning rate scheduler', 'Scatter Mo E library', 'Mega Blocks framework', and 'faiss-gpu library', but does not provide specific version numbers for any of them.
Experiment Setup Yes The training epoch is set to 10, which contains about 200,000 steps, and the warmup steps are set to 2,000. The batch size is set to 512 and the learning rate is set to 1 for BERT and T5 and 0.5 for GPT. The mask rate of MLM is set to 0.15. For Mo E-Dropout and SSD, we set the number of experts to 32 and the number of selected experts to 6. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps.