Depth-Adaptive Transformer
Authors: Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on IWSLT14 German-English translation as well as WMT 14 English-French translation show that we can match the performance of well tuned baseline models at up to 76% less computation ( 4). |
| Researcher Affiliation | Collaboration | Maha Elbayad Univ. Grenoble Alpes Jiatao Gu, Edouard Grave, Michael Auli Facebook AI Research |
| Pseudocode | Yes | Algorithm 2 Adaptive decoding with Tok-geometric-like |
| Open Source Code | No | The paper states models are implemented in fairseq, which is a third-party toolkit, but it does not provide an explicit statement or link for the authors' own implementation code. |
| Open Datasets | Yes | IWSLT 14 German to English (De-En). We use the setup of Edunov et al. (2018) and train on 160K sentence pairs. [...] WMT 14 English to French (En-Fr). We also experiment on the much larger WMT 14 English French task comprising 35.5m training sentence pairs. |
| Dataset Splits | Yes | IWSLT 14 German to English (De-En). We use the setup of Edunov et al. (2018) and train on 160K sentence pairs. [...] WMT 14 English to French (En-Fr). We develop on 26k held out pairs and test on newstest14. |
| Hardware Specification | No | The paper mentions training on '128 GPUs' and '2 GPUs' but does not specify the model or type of GPUs, or any other specific hardware components like CPUs or memory. |
| Software Dependencies | No | The paper mentions that 'Models are implemented in fairseq (Ott et al., 2019) and are trained with Adam (Kingma & Ba, 2015)' but does not provide version numbers for fairseq, Adam, or any other software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | We use N = 6 blocks, a feed-forward network (ffn) of intermediate-dimension 1024, 4 heads, dropout 0.3, embedding dimension denc = 512 for the encoder and ddec = 256 for the decoder. Embeddings are untied with 6 different output classifiers. We evaluate with a single checkpoint and a beam of width 5. [...] We train for 50k updates on 128 GPUs with a batch size of 460k tokens for WMT 14 En-Fr and on 2 GPUs with 8k tokens per batch for IWSLT 14 De-En. To stabilize training, we re-normalize the gradients if the norm exceeds gclip = 3. |