Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Agreement on Target-Bidirectional Recurrent Neural Networks for Sequence-to-Sequence Learning

Authors: Lemao Liu, Andrew Finch, Masao Utiyama, Eiichiro Sumita

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments were performed on three standard sequence-to-sequence transduction tasks: machine transliteration, grapheme-to-phoneme transformation and machine translation. The results show that the proposed approach achieves consistent and substantial improvements, compared to many state-of-the-art systems.
Researcher Affiliation	Academia	Lemao Liu EMAIL Andrew Finch EMAIL Masao Utiyama EMAIL Eiichiro Sumita EMAIL National Institute of Information & Communications Technology 3-5 Hikari-dai, Seika-cho, Soraku-gun, Kyoto, Japan
Pseudocode	Yes	Algorithm 1 Beam Search Algorithm Algorithm 2 Variant Beam Search Algorithm
Open Source Code	Yes	Our toolkit is publicly available on https://github.com/lemaoliu/Agtarbidir.
Open Datasets	Yes	For the machine transliteration task, we conducted both Japanese-to-English (Jp-En) and English-to-Japanese (En-Jp) directional subtasks. The transliteration training, development and test sets were taken from Wikipedia inter-language link titles7 from Fukunishi, Finch, Yamamoto, and Sumita (2013): For grapheme-to-phoneme (Gm-Pm) conversion, the standard CMUdict8 data sets were used: the original training set was randomly split into our training set (about 110000 sequence pairs) and development set (2000 pairs); the original test set consisting of about 12000 pairs was used for testing. For the Jp-En task, we use the data from NTCIR-9 (Goto, Lu, Chow, Sumita, & Tsou, 2011): the training data consisted of 2.0M sentence pairs, The development and test sets contained 2K sentences with a single referece, respectively. For the Ch-En task, we used the data from the NIST2008 Open Machine Translation Campaign: the training data consisted of 1.8M sentence pairs, the development set was nist02 (878 sentences), and the test sets were nist05 (1082 sentences), nist06 (1664 sentences) and nist08 (1357 sentences).
Dataset Splits	Yes	the training data consisted of 59000 sequence pairs composed of 313378 Japanese katakana characters and 445254 English characters; the development and test data were manually cleaned and each of them consisted of 1000 sequence pairs. For grapheme-to-phoneme (Gm-Pm) conversion, the standard CMUdict8 data sets were used: the original training set was randomly split into our training set (about 110000 sequence pairs) and development set (2000 pairs); the original test set consisting of about 12000 pairs was used for testing. For the Jp-En task, we use the data from NTCIR-9 (Goto, Lu, Chow, Sumita, & Tsou, 2011): the training data consisted of 2.0M sentence pairs, The development and test sets contained 2K sentences with a single referece, respectively. For the Ch-En task, we used the data from the NIST2008 Open Machine Translation Campaign: the training data consisted of 1.8M sentence pairs, the development set was nist02 (878 sentences), and the test sets were nist05 (1082 sentences), nist06 (1664 sentences) and nist08 (1357 sentences).
Hardware Specification	Yes	Training was conducted on a single Tesla K80 GPU, and it took about 6 days to train a single Ab RNN system on our large-scale data.
Software Dependencies	No	Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (Sci Py). We use the adadelta for training RNN based systems: the decay rate ρ and constant ϵ were set as 0.95 and 10−6 as suggested by Zeiler (2012). Moses: a phrase based machine translation model (Koehn et al., 2007) used with default settings. GIZA++ (Och & Ney, 2000) with grow-diag-ﬁnal-and was used to build the translation model. We trained 5-gram target language models with srilm (Stolcke et al., 2002) using the training set for Jp-En and the Gigaword corpus for Ch-En.
Experiment Setup	Yes	For all of the re-implemented models based on Af RNN, the number of word embedding units and hidden units were set to 500. We use the adadelta for training RNN based systems: the decay rate ρ and constant ϵ were set as 0.95 and 10−6 as suggested by Zeiler (2012), and minibatch sizes were 16. In our experiments, we found one layer RNN works well for Af RNN, thanks to the limited vocabulary in this task. Therefore, we employ one layer RNN for all Af RNN based models including both unidirectional and bidirectional models. For all of RNN based models, we used the same conﬁguration and hyperparameters as in machine transliteration task except that the minibatch size was 64 for Gm-Pm task. We used the following settings for Ab RNN-based systems: the dimension of word embedding was 620, the dimension of hidden units was 1000, the batch size was 80, the source and target side vocabulary sizes were 30000, the maximum sequence length was set to 80, and the beam size for decoding was 12.