Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Neural Machine Translation: A Review

Authors: Felix Stahlberg

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The field of machine translation (MT), the automatic translation of written text from one natural language into another, has experienced a major paradigm shift in recent years. Statistical MT, which mainly relies on various count-based models and which used to dominate MT research for decades, has largely been superseded by neural machine translation (NMT), which tackles translation with a single neural network. In this work we will trace back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family. We will conclude with a short survey of more recent trends in the field. 8.1 Sentence Length Increasing the beam size exposes one of the most noticeable model errors in NMT. The red curve in Fig. 16 plots the BLEU score (Papineni et al., 2002) of a recent Transformer NMT model against the beam size. A beam size of 10 is optimal on this test set. Wider beams lead to a steady drop in translation performance because the generated translations are becoming too short (green curve). However, as expected, the log-probabilities of the found translations (blue curve) are decreasing as we increase the beam size. NMT seems to assign too much probability mass to short hypotheses which are only found with more exhaustive search.
Researcher Affiliation	Academia	Felix Stahlberg EMAIL University of Cambridge, Engineering Department, Trumpington Street Cambridge CB2 1PZ, United Kingdom
Pseudocode	Yes	Algorithm 1 One Step RNNsearch(sprev, yprev, h) 1: α Eq. 16 1 Z [exp(a(sprev, hi))]i [1,I] {Attention weights (α RI, Z as in Eq. 16)} 2: c Eq. 15 PI i=1 αi hi {Context vector update (c Rm)} 3: s Eq. 17 f(sprev, yprev, c) {RNN state update (s Rn)} 4: p Eq. 5 g(yprev, s, c) {p R\|Σtrg\| is the distribution over the next target token P(yj\| )} 5: return s, p Algorithm 2 Greedy RNNsearch(sinit, h) 1: y 2: s sinit 3: y <s> 4: while y = </s> do 5: s, p One Step RNNsearch(s, y, h) 6: y arg maxw Σtrg πw(p) 7: y.append(y) 8: end while 9: return y Algorithm 3 Beam RNNsearch(sinit, h, n N+) 1: Hcur {(ϵ, 0.0, sinit)} {Initialize with empty translation preﬁx and zero score} 2: repeat 3: Hnext 4: for all (y, pacc, s) Hcur do 5: if y\|y\| = </s> then 6: Hnext Hnext {(y, pacc, s)} {Hypotheses ending with </s> are not extended} 7: else 8: s, p One Step RNNsearch(s, y\|y\|, h) 9: Hnext Hnext S w Σtrg(y w, paccπw(p), s) {Add all possible continuations} 10: end if 11: end for 12: Hcur {(y, pacc, s) Hnext : \|{(y , p acc, s ) Hnext : p acc > pacc}\| < n} {Select n-best} 13: (ˆy, ˆpacc, ˆs) arg max(y, pacc, s) Hcur pacc 14: until ˆy\|ˆy\| = </s> 15: return ˆy
Open Source Code	No	The paper is a review article and primarily discusses existing NMT toolkits and their characteristics in Table 1 (e.g., "Tensor2Tensor Vaswani et al. (2018) Tensor Flow", "Open NMT-py Klein et al. (2017) Lua, (Py)Torch, TF") and does not claim to release new code for any novel methodology developed by the author of this paper. The URLs provided in Table 1 point to these third-party tools, not to code specific to this review.
Open Datasets	Yes	8.1 Sentence Length [...] The red curve in Fig. 16 plots the BLEU score (Papineni et al., 2002) of a recent Transformer NMT model against the beam size. A beam size of 10 is optimal on this test set. Fig. 18: Distribution of words in the English portion of the English-German WMT18 training set (5.9M sentences, 140M words).
Dataset Splits	Yes	8.1 Sentence Length [...] The red curve in Fig. 16 plots the BLEU score (Papineni et al., 2002) of a recent Transformer NMT model against the beam size. A beam size of 10 is optimal on this test set. Wider beams lead to a steady drop in translation performance because the generated translations are becoming too short (green curve). Fig. 18: Distribution of words in the English portion of the English-German WMT18 training set (5.9M sentences, 140M words).
Hardware Specification	No	NMT decoding is very fast on GPU hardware and can reach up to 5000 words per second.10 https://marian-nmt.github.io/features/ However, GPUs are very expensive, and speeding up CPU decoding to the level of SMT remains more challenging. The paper mentions "GPU hardware" and "CPU decoding" in general terms but does not provide specific models (e.g., NVIDIA A100, Intel Xeon) or detailed specifications of the hardware used for any experiments or analyses it presents.
Software Dependencies	No	Table 1 lists various NMT toolkits and their underlying frameworks, such as "Tensor2Tensor ... Tensor Flow", "Fairseq ... Py Torch", "Open NMT-py ... Lua, (Py)Torch, TF", "Sockeye ... MXNet", "Nematus ... Tensor Flow, Theano", "Marian ... C++", "SGNMT ... Tensor Flow, Theano, Cython". While these are software names, the paper does not specify version numbers for any of these frameworks or tools, which is necessary for reproducible software dependencies.
Experiment Setup	No	The paper is a review of Neural Machine Translation, discussing existing architectures, concepts, and findings from other research. It does not present new experimental results or methodologies developed by the author that would require detailing a specific experimental setup, hyperparameters, or training configurations for replication.