Multi-Head Mixture-of-Experts

Authors: Xun Wu, Shaohan Huang, Wenhui Wang, Shuming Ma, Li Dong, Furu Wei

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results across different parameter scales (300M to 7B) and three pre-training tasks English-focused language modeling, multi-lingual language modeling and masked multi-modality modeling along with multiple downstream validation tasks, demonstrate the effectiveness of MH-Mo E.
Researcher Affiliation Industry Xun Wu, Shaohan Huang , Wenhui Wang, Shuming Ma, Li Dong, Furu Wei Microsoft Research Asia xunwu@microsoft.com, shaohanh@microsoft.com, fuwei@microsoft.com
Pseudocode Yes As the Pytorch-like style pseudocode of MH-Mo E shown in Appendix D
Open Source Code Yes We provide the code for model training and testing in the supplementary materials.
Open Datasets Yes English-focused experiments use the Red Pajama dataset [8], which is an open-source pre-training dataset. (2) multilingual tasks follow XLM [20] and use the multilingual Wikipedia as training data. (3) multimodal tasks use a masked multi-modality modeling task with a large dataset of images, documents, and image-text pairs. Further details are available in the Appendix A.
Dataset Splits No The paper mentions evaluating on 'validation dataset' and 'test-dev test-std dev test-P' splits for different tasks, but does not explicitly provide the specific percentages or sample counts for how the training, validation, and test dataset splits were created for its own experiments.
Hardware Specification Yes The pre-training procedure takes 14 days on 2 NVIDIA DGX-2 Stations. For Masked Multi-modal Modeling, we build Dense, X-Mo E and MH-Mo E following the same Transformer encoder architecture as BEi T v3 [37]. The pre-training procedure takes 4 days on 2 NVIDIA DGX-2 Stations.
Software Dependencies No The paper mentions using GPT-4 vocabulary and Sentence Piece for tokenization, and provides PyTorch-like pseudocode, but does not specify version numbers for Python, PyTorch, CUDA, or other software dependencies.
Experiment Setup Yes For English-focused Language Modeling and Multi-lingual Language Modeling, we construct Dense, X-Mo E and MH-Mo E using the Transformer [36] decoder (L = 12, H = 768, A = 12) with the GPT-43 vocabulary as the backbone architecture. For all three pre-training tasks, we set the head number h = 4. More details about architecture and training hyperparameters can be found in Appendix B and C. Appendix C (Table 10) lists specific hyperparameters like "Batch size 256", "Optimizer Adam", "Maximum learning rate 5e-4", "Warmup steps 10,000", "Weight decay 0.01", "Transformer dropout 0.1", "Load balancing coefficient 1e-2".