Depth-Adaptive Transformer

Authors: Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on IWSLT14 German-English translation as well as WMT 14 English-French translation show that we can match the performance of well tuned baseline models at up to 76% less computation ( 4).
Researcher Affiliation Collaboration Maha Elbayad Univ. Grenoble Alpes Jiatao Gu, Edouard Grave, Michael Auli Facebook AI Research
Pseudocode Yes Algorithm 2 Adaptive decoding with Tok-geometric-like
Open Source Code No The paper states models are implemented in fairseq, which is a third-party toolkit, but it does not provide an explicit statement or link for the authors' own implementation code.
Open Datasets Yes IWSLT 14 German to English (De-En). We use the setup of Edunov et al. (2018) and train on 160K sentence pairs. [...] WMT 14 English to French (En-Fr). We also experiment on the much larger WMT 14 English French task comprising 35.5m training sentence pairs.
Dataset Splits Yes IWSLT 14 German to English (De-En). We use the setup of Edunov et al. (2018) and train on 160K sentence pairs. [...] WMT 14 English to French (En-Fr). We develop on 26k held out pairs and test on newstest14.
Hardware Specification No The paper mentions training on '128 GPUs' and '2 GPUs' but does not specify the model or type of GPUs, or any other specific hardware components like CPUs or memory.
Software Dependencies No The paper mentions that 'Models are implemented in fairseq (Ott et al., 2019) and are trained with Adam (Kingma & Ba, 2015)' but does not provide version numbers for fairseq, Adam, or any other software dependencies like Python or PyTorch.
Experiment Setup Yes We use N = 6 blocks, a feed-forward network (ffn) of intermediate-dimension 1024, 4 heads, dropout 0.3, embedding dimension denc = 512 for the encoder and ddec = 256 for the decoder. Embeddings are untied with 6 different output classifiers. We evaluate with a single checkpoint and a beam of width 5. [...] We train for 50k updates on 128 GPUs with a batch size of 460k tokens for WMT 14 En-Fr and on 2 GPUs with 8k tokens per batch for IWSLT 14 De-En. To stabilize training, we re-normalize the gradients if the norm exceeds gclip = 3.