Reinforced Curriculum Learning on Pre-Trained Neural Machine Translation Models
Authors: Mingjun Zhao, Haijiang Wu, Di Niu, Xiaoli Wang9652-9659
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on several translation datasets show that our method can further improve the performance of NMT when original batch training reaches its ceiling, without using additional new training data, and significantly outperforms several strong baseline methods. |
| Researcher Affiliation | Collaboration | 1University of Alberta, Edmonton, AB, Canada 2Platform and Content Group, Tencent, Shenzhen, China |
| Pseudocode | Yes | Algorithm 1 summarizes the overall learning process of the proposed framework. Algorithm 1: The Proposed Method |
| Open Source Code | No | The paper states 'build our system based on (Shangtong 2018)', which refers to an external library, but does not explicitly state that the authors' own source code for the described methodology is publicly available. |
| Open Datasets | Yes | CASICTB, CASIA2015, NEU are three independent datasets with 1M, 2M, and 2M examples from different data sources in WMT18 which is a public translation dataset in news area with more than 20M samples. |
| Dataset Splits | Yes | MTD is an internal news translation dataset with 1M samples in the training set and 1,892 samples in both the validation set and the test set. All three datasets share the same validation set newsdev2017 and the test set newstest2017 both composed of 2k samples. |
| Hardware Specification | Yes | We implement our models in Py Torch 1.1.0 (Paszke et al. 2017) and train the model with a single Tesla P40. |
| Software Dependencies | Yes | We implement our models in Py Torch 1.1.0 (Paszke et al. 2017) and train the model with a single Tesla P40. |
| Experiment Setup | Yes | It consists of a 6-layer encoder and decoder, with 8 attention heads, and 2,048 units for the feed-forward layers. The multi-head attention model dimension and the word embedding size are both set to 512. During training, we use Adam optimizer (Kingma and Ba 2015) with a learning rate of 2.0 decaying with a noam scheduler and a warm-up steps of 8,000. Each training batch contains 4,096 tokens and is selected with bucketing (Kocmi and Bojar 2017). During inference, we employ beam search with a beam size of 5. |