Exploring the Benefit of Activation Sparsity in Pre-training
Authors: Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Jie Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on three representative text models, including GPT (Radford et al., 2019), BERT (Devlin et al., 2019), and T5 (Raffel et al., 2020), with different architectures and pre-training objectives. |
| Researcher Affiliation | Collaboration | 1NLP Group, DCST, IAI, BNRIST, Tsinghua University 2Gaoling School of Artificial Intelligence, Renmin University of China 3Tencent 4Jiangsu Collaborative Innovation Center for Language Ability, Xuzhou, China. |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | Codes are available at https: //github.com/thunlp/moefication. |
| Open Datasets | Yes | Pre-training corpus: We use the Pile dataset (Gao et al., 2021a) as the pre-training corpus. |
| Dataset Splits | Yes | We save the model checkpoints every 4, 000 steps and calculate the activation sparsity of each checkpoint on the validation corpus. |
| Hardware Specification | Yes | The inference time is measured on a single NVIDIA RTX 3090 GPU... We use four NVIDIA A800 GPU for training... |
| Software Dependencies | No | The paper mentions several software components like 'Adam optimizer', 'Noam learning rate scheduler', 'Scatter Mo E library', 'Mega Blocks framework', and 'faiss-gpu library', but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | The training epoch is set to 10, which contains about 200,000 steps, and the warmup steps are set to 2,000. The batch size is set to 512 and the learning rate is set to 1 for BERT and T5 and 0.5 for GPT. The mask rate of MLM is set to 0.15. For Mo E-Dropout and SSD, we set the number of experts to 32 and the number of selected experts to 6. For SSD, we set the threshold τ to 0.9 and monitor the activation pattern every 3,000 steps. |