On the Representation Collapse of Sparse Mixture of Experts

Authors: Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, XIA SONG, Xian-Ling Mao, Heyan Huang, Furu Wei

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models.
Researcher Affiliation Collaboration Zewen Chi , Li Dong , Shaohan Huang , Damai Dai , Shuming Ma , Barun Patra , Saksham Singhal , Payal Bajaj , Xia Song , Xian-Ling Mao , Heyan Huang , Furu Wei Beijing Institute of Technology Microsoft Corporation Peking University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It describes the methods using mathematical equations and text.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes Pre-training Data Following (Chi et al., 2021), we use the combination of CCNet (Wenzek et al., 2019) and Wikipedia dump as pre-training corpora. ... We conduct a downstream evaluation on seven widely-used cross-lingual understanding benchmarks from XTREME (Hu et al., 2020). Specifically, we conduct experiments on Universal Dependencies v2.5 part-of-speech tagging (Zeman et al., 2019), Wiki Ann named entity recognition (Pan et al., 2017; Rahimi et al., 2019), natural language inference (XNLI; Conneau et al. 2018), paraphrase adversaries from word scrambling (PAWS-X; Yang et al. 2019), and question answering on MLQA (Lewis et al., 2020), XQu AD (Artetxe et al., 2020), and Ty Di QA-Gold P (Clark et al., 2020).
Dataset Splits Yes We sample multilingual sentences from m C4 (Xue et al., 2020), and construct an MLM validation dataset that contains 65, 536 sequences with lengths around 512. ... Among the benchmarks, we adopt the cross-lingual transfer setting, where the models are fine-tuned with the training data in English and evaluated in all target languages.
Hardware Specification Yes The pre-training procedure takes 2 days on 2 Nvidia DGX-2 Stations.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). It mentions the use of the Adam optimizer but not its version or other software.
Experiment Setup Yes Model Architecture and Hyperparameters We construct our X-MOE models using the Transformer (Vaswani et al., 2017) encoder (L = 12, H = 768, A = 12) with the vocabulary provided by Conneau et al. (2020) as the backbone architecture. ... The routing dimension de is set as 16. The gating temperature τ0 is set as 0.3 and 0.07 for the softmax gate and sigmoid gate, respectively. ... X-MOE models are pretrained with the Adam optimizer (β1 = 0.9, β2 = 0.98) using a batch size of 2, 048 for 125K steps.