On the Pareto Front of Multilingual Neural Machine Translation

Authors: Liang Chen, Shuming Ma, Dongdong Zhang, Furu Wei, Baobao Chang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By training over 200 multilingual models with various model sizes, data sizes, and language directions, we find it interesting that the performance of certain translation direction does not always improve with the increase of its weight in the multi-task optimization objective. In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget.
Researcher Affiliation Collaboration 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2Microsoft Research
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes We release the code at https://github.com/pkunlp-icler/Pareto MNMT for reproduction.
Open Datasets Yes We use datasets provided in WMT10 (Wang et al., 2020b) and WMT19 (Barrault et al., 2019) to conduct the MNMT experiment. The description of dataset is listed in Appendix A. Table 5: The datasets description for the main experiments. We randomly choose a subset of the full training set of a direction to form a smaller one.
Dataset Splits No The paper mentions using "validation loss" for checkpoint selection ("Evaluation is done every 5k steps, and we choose the best checkpoint with lowest average validation loss") implying the existence of a validation set. However, it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages or sample counts for each split from the overall dataset) to reproduce the partitioning.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or cloud computing instance types used for running the experiments.
Software Dependencies No The paper mentions using "fairseq(Ott et al., 2019) as the training framework" and "scipy.optimize.curve_fit function from the scipy library" (footnote 1) but does not specify version numbers for these software components.
Experiment Setup Yes Table 6: Overview of model sizes and optimization hyper-parameters. All models are trained with 4k warmup steps with the learning rate linearly increasing from 0 to 3e-4 then decreasing with inverse_sqrt learning rate scheduler. The label smoothing term is set to 0.1 following the NMT literature convention. Evaluation is done every 5k steps.