Taming Sparsely Activated Transformer with Stochastic Experts

Authors: Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao, Tuo Zhao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and Mo E models across various settings.
Researcher Affiliation Collaboration Georgia Institute of Technology Microsoft {simiaozuo,tourzhao}@gatech.edu, {xiaodl,jian.jiao,youki,hanyh,bzhang,jfgao}@microsoft.com
Pseudocode No No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found.
Open Source Code Yes Our code is publicly available at: https://github.com/microsoft/ Stochastic-Mixture-of-Experts.
Open Datasets Yes We use six language pairs: English to Vietnamese, English to German, and English to French from IWSLT; English to Romanian, English to Latvian, and English to Czech from Europarl3. Dataset statistics are summarized in Table 6 (Appendix B). [...] 3https://www.statmt.org/europarl [...] 5https://iwslt.org/
Dataset Splits Yes Dataset statistics are summarized in Table 6 (Appendix B). [...] Table 6: Statistics of low resource translation datasets. [...] Validation 5,098 7,283 8,453 1,900 1,949 2,902
Hardware Specification Yes All the experiments are conducted on NVIDIA V100 GPUs.
Software Dependencies No For low-resource and rich-resource translation, we train all the models using Fairseq1 (Ott et al., 2019). For multilingual translation, we use Deep Speed Mo E2 (Kim et al., 2021) to implement the Mo E models. (No version numbers provided for Fairseq or DeepSpeed).
Experiment Setup Yes The regularization strength is chosen to be α = 5.0. We set the batch size to be equivalent to 32k tokens... We use Adam as the optimizer with β1 = 0.9, β2 = 0.98, and we set the learning rate to be 0.0015. We train the model for 40k steps, and we test the model that yield the highest validation BLEU. For validation and testing, we use a beam size 5 and a length penalty 1.0.