reproducibilityindex.ai

Learning Light-Weight Translation Models from Deep Transformer

Authors: Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, Jingbo Zhu13217-13225

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results on several benchmarks validate the effectiveness of our method.
Researcher Affiliation	Collaboration	1NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China
Pseudocode	No	The paper includes diagrams and descriptions of algorithms, such as Figure 1 illustrating the GPKD method, but does not provide clearly labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	The code is publicly available at https://github.com/libeineu/GPKD.
Open Datasets	Yes	We ran experiments on the WMT16 English-German, NIST Open MT12 Chinese-English and WMT19 Chinese English translation tasks. We used the same datasets as in (Vaswani et al. 2017; Wu et al. 2019a; Wang et al. 2019). We randomly extracted nearly 1.9M bilingual corpus from NIST 12 Open MT3.
Dataset Splits	Yes	newstest2016 and newstest2014 was the validation and test data, respectively. MT06 was the validation set and the concatenation of MT04 and MT08 was the test set. We selected newstest2017 as the validation data and reported the BLEU scores on newstest2018 and newstest2019. We adopted the compound split strategy for En-De.
Hardware Specification	No	The paper mentions limiting "input/output tokens per batch to 4, 096/GPU" but does not specify any particular GPU model (e.g., NVIDIA A100, Tesla V100), CPU, or other specific hardware components used for running experiments.
Software Dependencies	No	The paper mentions using "Adam optimizer (Kingma and Ba 2015)" and "multi-bleu.perl" but does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used.
Experiment Setup	Yes	For training, we used Adam optimizer (Kingma and Ba 2015), and followed the hyper-parameters uesd in Wang et al. (2019). Then, we batched sentence pairs by approximate length, and limited input/output tokens per batch to 4, 096/GPU and updated the parameters every two steps. The hidden size of Base and Deep models was 512, and 1024 for big counterparts. The Base/Big/Deep models were updated for 50k/150k/50k steps on the En-De task, 25k/50k/25k steps on the NIST Zh-En task and 100k/200k/100k on the WMT Zh-En task. The beam size and length penalty were set to 4/0.6 and 6/1.3 for En-De and Zh-En tasks, respectively.