reproducibilityindex.ai

On the Representation Collapse of Sparse Mixture of Experts

Authors: Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, XIA SONG, Xian-Ling Mao, Heyan Huang, Furu Wei

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on cross-lingual language model pre-training and ﬁne-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models.
Researcher Affiliation	Collaboration	Zewen Chi , Li Dong , Shaohan Huang , Damai Dai , Shuming Ma , Barun Patra , Saksham Singhal , Payal Bajaj , Xia Song , Xian-Ling Mao , Heyan Huang , Furu Wei Beijing Institute of Technology Microsoft Corporation Peking University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks. It describes the methods using mathematical equations and text.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets	Yes	Pre-training Data Following (Chi et al., 2021), we use the combination of CCNet (Wenzek et al., 2019) and Wikipedia dump as pre-training corpora. ... We conduct a downstream evaluation on seven widely-used cross-lingual understanding benchmarks from XTREME (Hu et al., 2020). Speciﬁcally, we conduct experiments on Universal Dependencies v2.5 part-of-speech tagging (Zeman et al., 2019), Wiki Ann named entity recognition (Pan et al., 2017; Rahimi et al., 2019), natural language inference (XNLI; Conneau et al. 2018), paraphrase adversaries from word scrambling (PAWS-X; Yang et al. 2019), and question answering on MLQA (Lewis et al., 2020), XQu AD (Artetxe et al., 2020), and Ty Di QA-Gold P (Clark et al., 2020).
Dataset Splits	Yes	We sample multilingual sentences from m C4 (Xue et al., 2020), and construct an MLM validation dataset that contains 65, 536 sequences with lengths around 512. ... Among the benchmarks, we adopt the cross-lingual transfer setting, where the models are ﬁne-tuned with the training data in English and evaluated in all target languages.
Hardware Specification	Yes	The pre-training procedure takes 2 days on 2 Nvidia DGX-2 Stations.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). It mentions the use of the Adam optimizer but not its version or other software.
Experiment Setup	Yes	Model Architecture and Hyperparameters We construct our X-MOE models using the Transformer (Vaswani et al., 2017) encoder (L = 12, H = 768, A = 12) with the vocabulary provided by Conneau et al. (2020) as the backbone architecture. ... The routing dimension de is set as 16. The gating temperature τ0 is set as 0.3 and 0.07 for the softmax gate and sigmoid gate, respectively. ... X-MOE models are pretrained with the Adam optimizer (β1 = 0.9, β2 = 0.98) using a batch size of 2, 048 for 125K steps.