reproducibilityindex.ai

Mixed Cross Entropy Loss for Neural Machine Translation

Authors: Haoran Li, Wei Lu

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the superiority of mixed CE over CE on several machine translation datasets, WMT 16 Ro-En, WMT 16 Ru-En, and WMT 14 En-De in both teacher forcing and scheduled sampling setups. In this section, we conducted experiments to verify the effectiveness of mixed CE in teacher forcing and scheduled sampling on several benchmark datasets with different sizes.
Researcher Affiliation	Academia	1Stat NLP Research Group, Singapore University of Technology and Design, Singapore. Correspondence to: Wei Lu <luwei@sutd.edu.sg>.
Pseudocode	No	The paper provides mathematical formulations (e.g., Eq. 2, 3, 4, 5, 6, 7, 8, 9, 10) and describes procedures in text and figures, but it does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/haorannlp/mix.
Open Datasets	Yes	WMT 16 Romanian-English (Ro-En, 610K pairs), WMT 16 Russian-English (Ru-En, 2.1M pairs) and WMT 14 English German (En-De, 4.5M pairs). We used the preprocessed WMT 16 Ro-En dataset from Lee et al. (2018) For WMT 14 En De, we used the script from Fairseq (Ott et al., 2019)5 for preprocessing.
Dataset Splits	Yes	we used newstest2013 as the validation set instead following Zhang et al. (2019). We saved a checkpoint after training the model for each epoch and we selected the best checkpoint based on the performance on the validation set.
Hardware Specification	No	The paper describes training models and using a 'Standard base Transformer', but it does not specify any hardware details such as GPU models, CPU types, or memory used for the experiments.
Software Dependencies	No	The paper mentions tools like 'Fairseq' and 'Adam' optimizer, but it does not provide specific version numbers for any software dependencies, such as Python, PyTorch, or other libraries.
Experiment Setup	Yes	We trained the model for totally 8,000/45,000/80,000 iterations for Ro-En/Ru-En/De-En datasets with each batch containing 12,288 * 4/12,288 * 4/12,288 * 8 tokens. We used the Adam (Kingma & Ba, 2015) optimizer with β1 = 0.9, β2 = 0.98. Learning rate is 0.0007 and will be reduced by half... Unless otherwise speciﬁed, we also used label smoothing (γ = 0.1) in our experiments. The decay strategy for scheduled sampling...: ϵi = di/total iter, 0 < d < 1, 1 ≤ i ≤ total iter (5).