Acquiring Knowledge from Pre-Trained Model to Neural Machine Translation
Authors: Rongxiang Weng, Heng Yu, Shujian Huang, Shanbo Cheng, Weihua Luo9266-9273
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on WMT English to German, German to English and Chinese to English machine translation tasks show that our model outperforms strong baselines and the fine-tuning counterparts. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 2Machine Intelligence Technology Lab, Alibaba Group, Hangzhou, China |
| Pseudocode | No | The paper includes diagrams but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at: https://github.com/wengrx/APT-NMT |
| Open Datasets | Yes | We conduct experiments on the WMT datasets1, including WMT17 Chinese to English (ZH EN), WMT 14 English to German (EN DE) and German to English (DE EN) and the corresponding monolingual data. |
| Dataset Splits | Yes | On the ZH EN, we use WMT17 as training set which consists of about 7.5 million sentence pairs (only CWMT part). We use newsdev2017 as validation set which has 2002 sentence pairs, and newstest2017 as test set which have 2001 sentence pairs. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions implementing their approach with an 'in-house implementation of Transformer derived from the tensor2tensor' but does not specify version numbers for Python, TensorFlow, or other key libraries. |
| Experiment Setup | Yes | For Transformer, we set the dimension of the input and output of all layers as 512, and that of the feed-forward layer to 2048. We employ 8 parallel attention heads. The number of layers for the encoder and decoder are 6. ... Each batch has 50 sentence and the maximum length of a sentence is limited to 100. We use label smoothing with value 0.1 and dropout with a rate of 0.1. We use the Adam (Kingma and Ba 2014) to update the parameters, and the learning rate was varied under a warm-up strategy with 4000 steps. |