reproducibilityindex.ai

Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation

Authors: Rongxiang Weng, Heng Yu, Shujian Huang, Shanbo Cheng, Weihua Luo9266-9273

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on WMT English to German, German to English and Chinese to English machine translation tasks show that our model outperforms strong baselines and the ﬁne-tuning counterparts.
Researcher Affiliation	Collaboration	1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Machine Intelligence Technology Lab, Alibaba Group, Hangzhou, China
Pseudocode	No	The paper includes diagrams but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at: https://github.com/wengrx/APT-NMT
Open Datasets	Yes	We conduct experiments on the WMT datasets1, including WMT17 Chinese to English (ZH EN), WMT 14 English to German (EN DE) and German to English (DE EN) and the corresponding monolingual data.
Dataset Splits	Yes	On the ZH EN, we use WMT17 as training set which consists of about 7.5 million sentence pairs (only CWMT part). We use newsdev2017 as validation set which has 2002 sentence pairs, and newstest2017 as test set which have 2001 sentence pairs.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for the experiments.
Software Dependencies	No	The paper mentions implementing their approach with an 'in-house implementation of Transformer derived from the tensor2tensor' but does not specify version numbers for Python, TensorFlow, or other key libraries.
Experiment Setup	Yes	For Transformer, we set the dimension of the input and output of all layers as 512, and that of the feed-forward layer to 2048. We employ 8 parallel attention heads. The number of layers for the encoder and decoder are 6. ... Each batch has 50 sentence and the maximum length of a sentence is limited to 100. We use label smoothing with value 0.1 and dropout with a rate of 0.1. We use the Adam (Kingma and Ba 2014) to update the parameters, and the learning rate was varied under a warm-up strategy with 4000 steps.