Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Meta Back-Translation
Authors: Hieu Pham, Xinyi Wang, Yiming Yang, Graham Neubig
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our evaluations in both the standard datasets WMT En De 14 and WMT En-Fr 14, as well as a multilingual translation setting, our method leads to significant improvements over strong baselines. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. Figure 1 is an illustrative example, not pseudocode. |
| Open Source Code | No | The paper states it uses existing architectures and frameworks ('Transformer-Base architecture (Vaswani et al., 2017)' and 'fairseq (Ott et al., 2019)') but does not provide a link or explicit statement for its own source code for Meta BT. |
| Open Datasets | Yes | For the standard setting, we consider two large datasets: WMT En-De 2014 and WMT En-Fr 20141, tokenized with Sentence Piece (Kudo & Richardson, 2018) using a joint vocabulary size of 32K for each dataset. ... The multilingual setting uses the multilingual TED talk dataset (Qi et al., 2018). |
| Dataset Splits | Yes | For the standard setting, we consider two large datasets: WMT En-De 2014 and WMT En-Fr 20141 [footnote to http://www.statmt.org/wmt14/]. ... we also have a separate validation set for hyper-parameter tuning and model selection. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions 'fairseq (Ott et al., 2019)' and 'Adam (Kingma & Ba, 2015)' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Optimizer: Adam (Kingma & Ba, 2015) with β1 = 0.9 and β2 = 0.98. The initial learning rate is 5e-4 and is warmed up for 4000 steps, then decayed using inverse square root. Label smoothing: 0.1. Dropout: 0.3. Min-max batching for parallel data, with 4096 tokens per batch. For monolingual data, the batch size is 64 sentences for WMT En-De, 16 for WMT En-Fr, and 8 for the multilingual data. |