Mixed Cross Entropy Loss for Neural Machine Translation
Authors: Haoran Li, Wei Lu
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the superiority of mixed CE over CE on several machine translation datasets, WMT 16 Ro-En, WMT 16 Ru-En, and WMT 14 En-De in both teacher forcing and scheduled sampling setups. In this section, we conducted experiments to verify the effectiveness of mixed CE in teacher forcing and scheduled sampling on several benchmark datasets with different sizes. |
| Researcher Affiliation | Academia | 1Stat NLP Research Group, Singapore University of Technology and Design, Singapore. Correspondence to: Wei Lu <luwei@sutd.edu.sg>. |
| Pseudocode | No | The paper provides mathematical formulations (e.g., Eq. 2, 3, 4, 5, 6, 7, 8, 9, 10) and describes procedures in text and figures, but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/haorannlp/mix. |
| Open Datasets | Yes | WMT 16 Romanian-English (Ro-En, 610K pairs), WMT 16 Russian-English (Ru-En, 2.1M pairs) and WMT 14 English German (En-De, 4.5M pairs). We used the preprocessed WMT 16 Ro-En dataset from Lee et al. (2018) For WMT 14 En De, we used the script from Fairseq (Ott et al., 2019)5 for preprocessing. |
| Dataset Splits | Yes | we used newstest2013 as the validation set instead following Zhang et al. (2019). We saved a checkpoint after training the model for each epoch and we selected the best checkpoint based on the performance on the validation set. |
| Hardware Specification | No | The paper describes training models and using a 'Standard base Transformer', but it does not specify any hardware details such as GPU models, CPU types, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions tools like 'Fairseq' and 'Adam' optimizer, but it does not provide specific version numbers for any software dependencies, such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | We trained the model for totally 8,000/45,000/80,000 iterations for Ro-En/Ru-En/De-En datasets with each batch containing 12,288 * 4/12,288 * 4/12,288 * 8 tokens. We used the Adam (Kingma & Ba, 2015) optimizer with β1 = 0.9, β2 = 0.98. Learning rate is 0.0007 and will be reduced by half... Unless otherwise specified, we also used label smoothing (γ = 0.1) in our experiments. The decay strategy for scheduled sampling...: ϵi = di/total iter, 0 < d < 1, 1 ≤ i ≤ total iter (5). |