Deep Transformers with Latent Depth

Authors: Xian Li, Asa Cooper Stickland, Yuqing Tang, Xiang Kong

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate on WMT English-German machine translation and masked language modeling tasks, where our method outperforms existing approaches for training deeper Transformers. Experiments on multilingual machine translation demonstrate that this approach can effectively leverage increased model capacity and bring universal improvement for both many-to-one and one-to-many translation with diverse language pairs.
Researcher Affiliation Collaboration 1Facebook AI {xianl, yuqtang, xiangk}@fb.com 2University of Edinburgh {a.cooper.stickland}@ed.ac.uk
Pseudocode Yes Algorithm 1 Training with Latent Layers
Open Source Code No The paper does not contain an explicit statement or a link providing access to the source code for the methodology described.
Open Datasets Yes We use the same preprocessed WMT 16 English-German sentence pairs as is used in [30, 31]. In particular we use as training data the Wikipedia text of the 25 languages used in the m BART [17] model. We evaluate the proposed approach on multilingual machine translation using the 58-language TED corpus [23].
Dataset Splits Yes Next, we compared the learning curves when training deeper models. As is shown in Figure 3 (evaluated on multilingual translation task O2M-Diverse dataset), the baseline model with static depth diverged for a 24-layer decoder, while using the latent layers ((LL-D) approach we could train both 24-layer and 100-layer decoder successfully. ...they achieve lower validation loss, with a smaller gap between train and validation losses.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies No The paper mentions tools like 'sacre BLEU [22]' and 'fairseq [18]' but does not specify their version numbers or other specific software dependencies with versions.
Experiment Setup Yes We use beam size 5 and length penalty 1.0 in decoding and report corpus-level BLEU with sacre BLEU [22]. We compared the learning curves when training deeper models. As is shown in Figure 3 (evaluated on multilingual translation task O2M-Diverse dataset), the baseline model with static depth diverged for a 24-layer decoder, while using the latent layers ((LL-D) approach we could train both 24-layer and 100-layer decoder successfully. By increasing the contribution from the DKL to the total loss, layer selections are more evenly spread out across languages, i.e. ul becomes more uniform. This is also reflected in Table 6 where the effective depth" EL increases with β.