A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Authors: Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental our expert pruning method is verified on large vision Mo E models such as V-Mo E and E3-Mo E fine-tuned on benchmark datasets such as CIFAR-10, CIFAR-100, and Image Net.Empirical validation: We provide experimental demonstration of the proposed pruning technique s effectiveness on state-of-the-art vision Mo E models. We evaluate on several vision Mo E (VMo E) (Riquelme et al., 2021) and ensembles of vision Mo E (known as the efficient ensemble of experts, E3) (Allingham et al., 2022) models with thousands of millions of parameters, fine-tuned on benchmark datasets such as CIFAR-10, CIFAR-100, and Image Net.
Researcher Affiliation Collaboration 1Department of Electrical, Computer and Systems Engineering, Rensselaer Polytechnic Institute, NY, USA 2IBM Research, Yorktown Heights, NY, USA 3Department of Computer Science, Rensselaer Polytechnic Institute, NY, USA.
Pseudocode Yes Algorithm 1 The Expert Pruning Algorithm for the Theoretical Analysis
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the methodology described, nor does it include a link to a code repository.
Open Datasets Yes verified on large vision Mo E models such as V-Mo E and E3-Mo E fine-tuned on benchmark datasets such as CIFAR-10, CIFAR-100, and Image Net.Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Canadian Institute For Advanced Research, 2009.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115: 211 252, 2015.
Dataset Splits No The paper mentions fine-tuning on benchmark datasets (CIFAR-10, CIFAR-100, Image Net) and refers to a "validation set" in the introduction, but it does not provide specific details about the training, validation, and test dataset splits, such as percentages, sample counts, or references to predefined splits.
Hardware Specification Yes We implement the pruned model in parallel into two NVIDIA RTX A5000 GPUs for inference and the post-pruning fine-tuning.
Software Dependencies No The paper describes training methods like "SGD with momentum and cosine learning rate decay" but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) along with their version numbers, which would be necessary for reproducible software dependencies.
Experiment Setup Yes For post-pruning fine-tuning, as the model size is large, we divide the batch size into half from the original case for CIFAR-10 and CIFAR-100. However, the number of steps is the same. For the same reason, we divide the original batch-size by 32 folds so as the learning rate for Image Net and hence increase the post-training fine-tuning steps by 32 times of the original. Rest of the hyperparameters are same as in the original fine-tuning process described by the authors.