Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Depth-Adaptive Transformer
Authors: Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on IWSLT14 German-English translation as well as WMT 14 English-French translation show that we can match the performance of well tuned baseline models at up to 76% less computation ( 4). |
| Researcher Affiliation | Collaboration | Maha Elbayad Univ. Grenoble Alpes Jiatao Gu, Edouard Grave, Michael Auli Facebook AI Research |
| Pseudocode | Yes | Algorithm 2 Adaptive decoding with Tok-geometric-like |
| Open Source Code | No | The paper states models are implemented in fairseq, which is a third-party toolkit, but it does not provide an explicit statement or link for the authors' own implementation code. |
| Open Datasets | Yes | IWSLT 14 German to English (De-En). We use the setup of Edunov et al. (2018) and train on 160K sentence pairs. [...] WMT 14 English to French (En-Fr). We also experiment on the much larger WMT 14 English French task comprising 35.5m training sentence pairs. |
| Dataset Splits | Yes | IWSLT 14 German to English (De-En). We use the setup of Edunov et al. (2018) and train on 160K sentence pairs. [...] WMT 14 English to French (En-Fr). We develop on 26k held out pairs and test on newstest14. |
| Hardware Specification | No | The paper mentions training on '128 GPUs' and '2 GPUs' but does not specify the model or type of GPUs, or any other specific hardware components like CPUs or memory. |
| Software Dependencies | No | The paper mentions that 'Models are implemented in fairseq (Ott et al., 2019) and are trained with Adam (Kingma & Ba, 2015)' but does not provide version numbers for fairseq, Adam, or any other software dependencies like Python or PyTorch. |
| Experiment Setup | Yes | We use N = 6 blocks, a feed-forward network (ffn) of intermediate-dimension 1024, 4 heads, dropout 0.3, embedding dimension denc = 512 for the encoder and ddec = 256 for the decoder. Embeddings are untied with 6 different output classifiers. We evaluate with a single checkpoint and a beam of width 5. [...] We train for 50k updates on 128 GPUs with a batch size of 460k tokens for WMT 14 En-Fr and on 2 GPUs with 8k tokens per batch for IWSLT 14 De-En. To stabilize training, we re-normalize the gradients if the norm exceeds gclip = 3. |