Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts
Authors: Rachel S.Y. Teo, Tan Nguyen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically prove and numerically demonstrate that Momentum SMo E is more stable and robust than SMo E. In particular, we verify the advantages of Momentum SMo E over SMo E on a variety of practical tasks including Image Net-1K object recognition and Wiki Text-103 language modeling. |
| Researcher Affiliation | Academia | Rachel S.Y. Teo Department of Mathematics National University of Singapore EMAIL Tan M. Nguyen Department of Mathematics National University of Singapore EMAIL |
| Pseudocode | Yes | In this section, we provide the pseudocode as written in python for Momentum SMo E, Adam SMo E, and Robust Momentum SMo E for clarification on our implementation. These are found in Figures 11, 12 and 13 respectively. |
| Open Source Code | Yes | The code is publicly available at https://github.com/rachtsy/Momentum SMo E. |
| Open Datasets | Yes | Dataset: The Wiki Text-103 dataset [43] is derived from Wikipedia articles... and Datasets: We use the Image Net-1K dataset that contains 1.28M training images and 50K validation images. |
| Dataset Splits | Yes | The validation and test sets have 218,000 and 246,000 words, respectively, with both sets comprising 60 articles and totaling about 268,000 words. and Image Net-1K dataset that contains 1.28M training images and 50K validation images. |
| Hardware Specification | Yes | All experiments are conducted on a server with 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions using PyTorch for implementation but does not specify its version or versions of other software dependencies. |
| Experiment Setup | Yes | The small models train for 60 epochs, the medium and large SMo E models train for 80 epochs and the GLa M models train for 120 epochs without any additional load balancing loss. and We notice that Momentum SMo E is robust to the choice of ยต, and we select ยต = 0.7 for the final comparison with the baseline SMo E. On the other hand, when the value of ฮณ is too small, there is an adverse effect on the model. Hence, we select ฮณ = 1.0. |