On the Pareto Front of Multilingual Neural Machine Translation
Authors: Liang Chen, Shuming Ma, Dongdong Zhang, Furu Wei, Baobao Chang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By training over 200 multilingual models with various model sizes, data sizes, and language directions, we find it interesting that the performance of certain translation direction does not always improve with the increase of its weight in the multi-task optimization objective. In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2Microsoft Research |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | We release the code at https://github.com/pkunlp-icler/Pareto MNMT for reproduction. |
| Open Datasets | Yes | We use datasets provided in WMT10 (Wang et al., 2020b) and WMT19 (Barrault et al., 2019) to conduct the MNMT experiment. The description of dataset is listed in Appendix A. Table 5: The datasets description for the main experiments. We randomly choose a subset of the full training set of a direction to form a smaller one. |
| Dataset Splits | No | The paper mentions using "validation loss" for checkpoint selection ("Evaluation is done every 5k steps, and we choose the best checkpoint with lowest average validation loss") implying the existence of a validation set. However, it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages or sample counts for each split from the overall dataset) to reproduce the partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions using "fairseq(Ott et al., 2019) as the training framework" and "scipy.optimize.curve_fit function from the scipy library" (footnote 1) but does not specify version numbers for these software components. |
| Experiment Setup | Yes | Table 6: Overview of model sizes and optimization hyper-parameters. All models are trained with 4k warmup steps with the learning rate linearly increasing from 0 to 3e-4 then decreasing with inverse_sqrt learning rate scheduler. The label smoothing term is set to 0.1 following the NMT literature convention. Evaluation is done every 5k steps. |