Taming Sparsely Activated Transformer with Stochastic Experts
Authors: Simiao Zuo, Xiaodong Liu, Jian Jiao, Young Jin Kim, Hany Hassan, Ruofei Zhang, Jianfeng Gao, Tuo Zhao
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and Mo E models across various settings. |
| Researcher Affiliation | Collaboration | Georgia Institute of Technology Microsoft {simiaozuo,tourzhao}@gatech.edu, {xiaodl,jian.jiao,youki,hanyh,bzhang,jfgao}@microsoft.com |
| Pseudocode | No | No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found. |
| Open Source Code | Yes | Our code is publicly available at: https://github.com/microsoft/ Stochastic-Mixture-of-Experts. |
| Open Datasets | Yes | We use six language pairs: English to Vietnamese, English to German, and English to French from IWSLT; English to Romanian, English to Latvian, and English to Czech from Europarl3. Dataset statistics are summarized in Table 6 (Appendix B). [...] 3https://www.statmt.org/europarl [...] 5https://iwslt.org/ |
| Dataset Splits | Yes | Dataset statistics are summarized in Table 6 (Appendix B). [...] Table 6: Statistics of low resource translation datasets. [...] Validation 5,098 7,283 8,453 1,900 1,949 2,902 |
| Hardware Specification | Yes | All the experiments are conducted on NVIDIA V100 GPUs. |
| Software Dependencies | No | For low-resource and rich-resource translation, we train all the models using Fairseq1 (Ott et al., 2019). For multilingual translation, we use Deep Speed Mo E2 (Kim et al., 2021) to implement the Mo E models. (No version numbers provided for Fairseq or DeepSpeed). |
| Experiment Setup | Yes | The regularization strength is chosen to be α = 5.0. We set the batch size to be equivalent to 32k tokens... We use Adam as the optimizer with β1 = 0.9, β2 = 0.98, and we set the learning rate to be 0.0015. We train the model for 40k steps, and we test the model that yield the highest validation BLEU. For validation and testing, we use a beam size 5 and a length penalty 1.0. |