Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy

Authors: Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, Tianlong Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across 8 benchmarks validate the effectiveness of our proposals. For instance, our MC-SMo E achieves up to 80% memory and a 20% FLOPs reduction, with virtually no loss in performance.1
Researcher Affiliation Collaboration Pingzhi Li1 Zhenyu Zhang2 Prateek Yadav1 Yi-Lin Sung1 Yu Cheng3 Mohit Bansal1 Tianlong Chen1,4,5 1The University of North Carolina at Chapel Hill 2The University of Texas at Austin 3The Chinese University of Hong Kong 4MIT 5Harvard University
Pseudocode Yes Algorithm 1 The Overall Procedures of MC-SMo E.
Open Source Code Yes Our code is provided at https://github.com/UNITES-Lab/MC-SMo E.
Open Datasets Yes We use eight popular NLP tasks for supervised fine-tuning and evaluation: SST2 (Socher et al., 2013) for sentiment classification, MRPC (Dolan & Brockett, 2005) for paraphrase identification, Multi RC (Khashabi et al., 2018) for multiple-choice QA, COPA (Gordon et al., 2012) for sentence completion, Wino Grande (Sakaguchi et al., 2019) for conference resolution, SQu AD v1.1 (Rajpurkar et al., 2016) for extractive QA, Wiki QA (Yang et al., 2015) and Hotpot QA (Yang et al., 2018) for closed-book QA.
Dataset Splits Yes This encompasses batch sizes from {8, 16, 32, 64}, learning rates from {3 × 10−4, 1 × 10−4, 3 × 10−5, 1 × 10−5}, and epoch counts spanning {3, 5, 10, 20}, to pinpoint the optimal fine-tuned models.
Hardware Specification Yes All experiments are conducted with Py Torch and Deep Speed on NVIDIA A100 and A6000.
Software Dependencies No The paper mentions Py Torch and Deep Speed but does not specify version numbers for these software components.
Experiment Setup Yes This encompasses batch sizes from {8, 16, 32, 64}, learning rates from {3 × 10−4, 1 × 10−4, 3 × 10−5, 1 × 10−5}, and epoch counts spanning {3, 5, 10, 20}, to pinpoint the optimal fine-tuned models. Further fine-tuning hyper-parameters are fixed, as shown in Appendix Table A15. After merging and compression, we proceed to fine-tune the condensed model to restore its performance. Further, we apply knowledge distillation (KD) to compel the M-SMo E and MC-SMo E models to imitate the outputs generated by the full SMo E model on the training dataset. The hyper-parameters in the added KD loss are fixed for all tasks, please refer to Appendix A2 for more details.