Multi-Head Mixture-of-Experts
Authors: Xun Wu, Shaohan Huang, Wenhui Wang, Shuming Ma, Li Dong, Furu Wei
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results across different parameter scales (300M to 7B) and three pre-training tasks English-focused language modeling, multi-lingual language modeling and masked multi-modality modeling along with multiple downstream validation tasks, demonstrate the effectiveness of MH-Mo E. |
| Researcher Affiliation | Industry | Xun Wu, Shaohan Huang , Wenhui Wang, Shuming Ma, Li Dong, Furu Wei Microsoft Research Asia xunwu@microsoft.com, shaohanh@microsoft.com, fuwei@microsoft.com |
| Pseudocode | Yes | As the Pytorch-like style pseudocode of MH-Mo E shown in Appendix D |
| Open Source Code | Yes | We provide the code for model training and testing in the supplementary materials. |
| Open Datasets | Yes | English-focused experiments use the Red Pajama dataset [8], which is an open-source pre-training dataset. (2) multilingual tasks follow XLM [20] and use the multilingual Wikipedia as training data. (3) multimodal tasks use a masked multi-modality modeling task with a large dataset of images, documents, and image-text pairs. Further details are available in the Appendix A. |
| Dataset Splits | No | The paper mentions evaluating on 'validation dataset' and 'test-dev test-std dev test-P' splits for different tasks, but does not explicitly provide the specific percentages or sample counts for how the training, validation, and test dataset splits were created for its own experiments. |
| Hardware Specification | Yes | The pre-training procedure takes 14 days on 2 NVIDIA DGX-2 Stations. For Masked Multi-modal Modeling, we build Dense, X-Mo E and MH-Mo E following the same Transformer encoder architecture as BEi T v3 [37]. The pre-training procedure takes 4 days on 2 NVIDIA DGX-2 Stations. |
| Software Dependencies | No | The paper mentions using GPT-4 vocabulary and Sentence Piece for tokenization, and provides PyTorch-like pseudocode, but does not specify version numbers for Python, PyTorch, CUDA, or other software dependencies. |
| Experiment Setup | Yes | For English-focused Language Modeling and Multi-lingual Language Modeling, we construct Dense, X-Mo E and MH-Mo E using the Transformer [36] decoder (L = 12, H = 768, A = 12) with the GPT-43 vocabulary as the backbone architecture. For all three pre-training tasks, we set the head number h = 4. More details about architecture and training hyperparameters can be found in Appendix B and C. Appendix C (Table 10) lists specific hyperparameters like "Batch size 256", "Optimizer Adam", "Maximum learning rate 5e-4", "Warmup steps 10,000", "Weight decay 0.01", "Transformer dropout 0.1", "Load balancing coefficient 1e-2". |