Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On the Pareto Front of Multilingual Neural Machine Translation
Authors: Liang Chen, Shuming Ma, Dongdong Zhang, Furu Wei, Baobao Chang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By training over 200 multilingual models with various model sizes, data sizes, and language directions, we ο¬nd it interesting that the performance of certain translation direction does not always improve with the increase of its weight in the multi-task optimization objective. In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget. |
| Researcher Affiliation | Collaboration | 1National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University 2Microsoft Research |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | We release the code at https://github.com/pkunlp-icler/Pareto MNMT for reproduction. |
| Open Datasets | Yes | We use datasets provided in WMT10 (Wang et al., 2020b) and WMT19 (Barrault et al., 2019) to conduct the MNMT experiment. The description of dataset is listed in Appendix A. Table 5: The datasets description for the main experiments. We randomly choose a subset of the full training set of a direction to form a smaller one. |
| Dataset Splits | No | The paper mentions using "validation loss" for checkpoint selection ("Evaluation is done every 5k steps, and we choose the best checkpoint with lowest average validation loss") implying the existence of a validation set. However, it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages or sample counts for each split from the overall dataset) to reproduce the partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions using "fairseq(Ott et al., 2019) as the training framework" and "scipy.optimize.curve_ο¬t function from the scipy library" (footnote 1) but does not specify version numbers for these software components. |
| Experiment Setup | Yes | Table 6: Overview of model sizes and optimization hyper-parameters. All models are trained with 4k warmup steps with the learning rate linearly increasing from 0 to 3e-4 then decreasing with inverse_sqrt learning rate scheduler. The label smoothing term is set to 0.1 following the NMT literature convention. Evaluation is done every 5k steps. |