MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
Authors: Rachel S.Y. Teo, Tan Nguyen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically prove and numerically demonstrate that Momentum SMo E is more stable and robust than SMo E. In particular, we verify the advantages of Momentum SMo E over SMo E on a variety of practical tasks including Image Net-1K object recognition and Wiki Text-103 language modeling. |
| Researcher Affiliation | Academia | Rachel S.Y. Teo Department of Mathematics National University of Singapore rachel.tsy@u.nus.edu Tan M. Nguyen Department of Mathematics National University of Singapore tanmn@nus.edu.sg |
| Pseudocode | Yes | In this section, we provide the pseudocode as written in python for Momentum SMo E, Adam SMo E, and Robust Momentum SMo E for clarification on our implementation. These are found in Figures 11, 12 and 13 respectively. |
| Open Source Code | Yes | The code is publicly available at https://github.com/rachtsy/Momentum SMo E. |
| Open Datasets | Yes | Dataset: The Wiki Text-103 dataset [43] is derived from Wikipedia articles... and Datasets: We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. |
| Dataset Splits | Yes | The validation and test sets have 218,000 and 246,000 words, respectively, with both sets comprising 60 articles and totaling about 268,000 words. and Image Net-1K dataset that contains 1.28M training images and 50K validation images. |
| Hardware Specification | Yes | All experiments are conducted on a server with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using PyTorch for implementation but does not specify its version or versions of other software dependencies. |
| Experiment Setup | Yes | The small models train for 60 epochs, the medium and large SMo E models train for 80 epochs and the GLa M models train for 120 epochs without any additional load balancing loss. and We notice that Momentum SMo E is robust to the choice of µ, and we select µ = 0.7 for the final comparison with the baseline SMo E. On the other hand, when the value of γ is too small, there is an adverse effect on the model. Hence, we select γ = 1.0. |