MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

Authors: Rachel S.Y. Teo, Tan Nguyen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We theoretically prove and numerically demonstrate that Momentum SMo E is more stable and robust than SMo E. In particular, we verify the advantages of Momentum SMo E over SMo E on a variety of practical tasks including Image Net-1K object recognition and Wiki Text-103 language modeling.
Researcher Affiliation Academia Rachel S.Y. Teo Department of Mathematics National University of Singapore rachel.tsy@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg
Pseudocode Yes In this section, we provide the pseudocode as written in python for Momentum SMo E, Adam SMo E, and Robust Momentum SMo E for clarification on our implementation. These are found in Figures 11, 12 and 13 respectively.
Open Source Code Yes The code is publicly available at https://github.com/rachtsy/Momentum SMo E.
Open Datasets Yes Dataset: The Wiki Text-103 dataset [43] is derived from Wikipedia articles... and Datasets: We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images.
Dataset Splits Yes The validation and test sets have 218,000 and 246,000 words, respectively, with both sets comprising 60 articles and totaling about 268,000 words. and Image Net-1K dataset that contains 1.28M training images and 50K validation images.
Hardware Specification Yes All experiments are conducted on a server with 8 A100 GPUs.
Software Dependencies No The paper mentions using PyTorch for implementation but does not specify its version or versions of other software dependencies.
Experiment Setup Yes The small models train for 60 epochs, the medium and large SMo E models train for 80 epochs and the GLa M models train for 120 epochs without any additional load balancing loss. and We notice that Momentum SMo E is robust to the choice of µ, and we select µ = 0.7 for the final comparison with the baseline SMo E. On the other hand, when the value of γ is too small, there is an adverse effect on the model. Hence, we select γ = 1.0.