Mixed Cross Entropy Loss for Neural Machine Translation

Authors: Haoran Li, Wei Lu

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the superiority of mixed CE over CE on several machine translation datasets, WMT 16 Ro-En, WMT 16 Ru-En, and WMT 14 En-De in both teacher forcing and scheduled sampling setups. In this section, we conducted experiments to verify the effectiveness of mixed CE in teacher forcing and scheduled sampling on several benchmark datasets with different sizes.
Researcher Affiliation Academia 1Stat NLP Research Group, Singapore University of Technology and Design, Singapore. Correspondence to: Wei Lu <luwei@sutd.edu.sg>.
Pseudocode No The paper provides mathematical formulations (e.g., Eq. 2, 3, 4, 5, 6, 7, 8, 9, 10) and describes procedures in text and figures, but it does not include formal pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/haorannlp/mix.
Open Datasets Yes WMT 16 Romanian-English (Ro-En, 610K pairs), WMT 16 Russian-English (Ru-En, 2.1M pairs) and WMT 14 English German (En-De, 4.5M pairs). We used the preprocessed WMT 16 Ro-En dataset from Lee et al. (2018) For WMT 14 En De, we used the script from Fairseq (Ott et al., 2019)5 for preprocessing.
Dataset Splits Yes we used newstest2013 as the validation set instead following Zhang et al. (2019). We saved a checkpoint after training the model for each epoch and we selected the best checkpoint based on the performance on the validation set.
Hardware Specification No The paper describes training models and using a 'Standard base Transformer', but it does not specify any hardware details such as GPU models, CPU types, or memory used for the experiments.
Software Dependencies No The paper mentions tools like 'Fairseq' and 'Adam' optimizer, but it does not provide specific version numbers for any software dependencies, such as Python, PyTorch, or other libraries.
Experiment Setup Yes We trained the model for totally 8,000/45,000/80,000 iterations for Ro-En/Ru-En/De-En datasets with each batch containing 12,288 * 4/12,288 * 4/12,288 * 8 tokens. We used the Adam (Kingma & Ba, 2015) optimizer with β1 = 0.9, β2 = 0.98. Learning rate is 0.0007 and will be reduced by half... Unless otherwise specified, we also used label smoothing (γ = 0.1) in our experiments. The decay strategy for scheduled sampling...: ϵi = di/total iter, 0 < d < 1, 1 ≤ i ≤ total iter (5).