Magneto: A Foundation Transformer

Authors: Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEi T), speech recognition, and multimodal pretraining (i.e., BEi T-3).
Researcher Affiliation Collaboration 1University of Chinese Academy of Sciences, Beijing, China 2Microsoft.
Pseudocode Yes Figure 2. Top left: pseudocode of Sub-LN.
Open Source Code No The paper references third-party open-source codebases (e.g., Fairseq, ESPnet) that were used for implementation, but it does not state that the authors' specific implementation of MAGNETO is open-source or provide a link to its repository.
Open Datasets Yes We start with a model that has the same model configuration as GPT-3 Medium (350M), and further scale its depth from 24L to 48L and 72L. The model is trained on an English-language corpus, which is a subset of the data from Liu et al. (2019) and the English portion of CC100 corpus. ... We pretrain all models on Image Net-1k (Russakovsky et al., 2015) with 300 epochs schedule. ... We implement the proposed MAGNETO based on the open-source ESPnet repository (Watanabe et al., 2018) for speech recognition, and evaluate its performance on the Libri Speech 960h (Panayotov et al., 2015) benchmark. ... We conduct experiments on multimodal pretraining following BEi T-3 (Wang et al., 2022b) and evaluate the model on downstream vision-language benchmarks, including VQA 2.0 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2019).
Dataset Splits Yes We fine-tune the models on the GLUE (Wang et al., 2018) benchmarks. ... During testing, we select the checkpoint based on the performance of the validation set. ... We pretrain all models on Image Net-1k (Russakovsky et al., 2015) ... We evaluate its performance on the Libri Speech 960h (Panayotov et al., 2015) benchmark. ... We evaluate the model on downstream vision-language benchmarks, including VQA 2.0 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2019).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using 'Fairseq1 codebase' and the 'open-source ESPnet repository' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes More details regarding the hyperparameters can be found in the appendix. (Referring to Tables 9-17 in Appendix C which list numerous hyperparameters like Layers, Hidden size, FFN inner hidden size, Attention heads, Training updates, Peak learning rate, Batch size, Adam beta, Learning rate schedule, Warmup updates, Gradient clipping, Dropout, Weight decay, etc.)