reproducibilityindex.ai

Taming Sparsely Activated Transformer with Stochastic Experts

Authors: Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao, Tuo Zhao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efﬁcient in that they signiﬁcantly outperform the Transformer and Mo E models across various settings.
Researcher Affiliation	Collaboration	Georgia Institute of Technology Microsoft {simiaozuo,tourzhao}@gatech.edu, {xiaodl,jian.jiao,youki,hanyh,bzhang,jfgao}@microsoft.com
Pseudocode	No	No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found.
Open Source Code	Yes	Our code is publicly available at: https://github.com/microsoft/ Stochastic-Mixture-of-Experts.
Open Datasets	Yes	We use six language pairs: English to Vietnamese, English to German, and English to French from IWSLT; English to Romanian, English to Latvian, and English to Czech from Europarl3. Dataset statistics are summarized in Table 6 (Appendix B). [...] 3https://www.statmt.org/europarl [...] 5https://iwslt.org/
Dataset Splits	Yes	Dataset statistics are summarized in Table 6 (Appendix B). [...] Table 6: Statistics of low resource translation datasets. [...] Validation 5,098 7,283 8,453 1,900 1,949 2,902
Hardware Specification	Yes	All the experiments are conducted on NVIDIA V100 GPUs.
Software Dependencies	No	For low-resource and rich-resource translation, we train all the models using Fairseq1 (Ott et al., 2019). For multilingual translation, we use Deep Speed Mo E2 (Kim et al., 2021) to implement the Mo E models. (No version numbers provided for Fairseq or DeepSpeed).
Experiment Setup	Yes	The regularization strength is chosen to be α = 5.0. We set the batch size to be equivalent to 32k tokens... We use Adam as the optimizer with β1 = 0.9, β2 = 0.98, and we set the learning rate to be 0.0015. We train the model for 40k steps, and we test the model that yield the highest validation BLEU. For validation and testing, we use a beam size 5 and a length penalty 1.0.