Learning Light-Weight Translation Models from Deep Transformer
Authors: Bei Li, Ziyang Wang, Hui Liu, Quan Du, Tong Xiao, Chunliang Zhang, Jingbo Zhu13217-13225
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on several benchmarks validate the effectiveness of our method. |
| Researcher Affiliation | Collaboration | 1NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China 2Niu Trans Research, Shenyang, China |
| Pseudocode | No | The paper includes diagrams and descriptions of algorithms, such as Figure 1 illustrating the GPKD method, but does not provide clearly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | The code is publicly available at https://github.com/libeineu/GPKD. |
| Open Datasets | Yes | We ran experiments on the WMT16 English-German, NIST Open MT12 Chinese-English and WMT19 Chinese English translation tasks. We used the same datasets as in (Vaswani et al. 2017; Wu et al. 2019a; Wang et al. 2019). We randomly extracted nearly 1.9M bilingual corpus from NIST 12 Open MT3. |
| Dataset Splits | Yes | newstest2016 and newstest2014 was the validation and test data, respectively. MT06 was the validation set and the concatenation of MT04 and MT08 was the test set. We selected newstest2017 as the validation data and reported the BLEU scores on newstest2018 and newstest2019. We adopted the compound split strategy for En-De. |
| Hardware Specification | No | The paper mentions limiting "input/output tokens per batch to 4, 096/GPU" but does not specify any particular GPU model (e.g., NVIDIA A100, Tesla V100), CPU, or other specific hardware components used for running experiments. |
| Software Dependencies | No | The paper mentions using "Adam optimizer (Kingma and Ba 2015)" and "multi-bleu.perl" but does not provide specific version numbers for these or any other software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | For training, we used Adam optimizer (Kingma and Ba 2015), and followed the hyper-parameters uesd in Wang et al. (2019). Then, we batched sentence pairs by approximate length, and limited input/output tokens per batch to 4, 096/GPU and updated the parameters every two steps. The hidden size of Base and Deep models was 512, and 1024 for big counterparts. The Base/Big/Deep models were updated for 50k/150k/50k steps on the En-De task, 25k/50k/25k steps on the NIST Zh-En task and 100k/200k/100k on the WMT Zh-En task. The beam size and length penalty were set to 4/0.6 and 6/1.3 for En-De and Zh-En tasks, respectively. |