Parameter Differentiation Based Multilingual Neural Machine Translation

Authors: Qian Wang, Jiajun Zhang11440-11448

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multilingual datasets have demonstrated that our method significantly outperforms various strong baselines with different parameter sharing configurations.
Researcher Affiliation Academia Qian Wang1,2, Jiajun Zhang1,2* 1National Laboratory of Pattern Recognition, Institute of Automation, CAS 2School of Artificial Intelligence, University of Chinese Academy of Sciences {qian.wang, jjzhang}@nlpr.ia.cn
Pseudocode Yes Algorithm 1: Parameter Differentiation
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the methodology's code.
Open Datasets Yes We use the public OPUS and WMT multilingual datasets to evaluate our method on many-to-one (M2O) and one-to-many (O2M) translation scenarios, and the IWSLT datasets for the many-to-many (M2M) translation scenario. The OPUS dataset consists of English to 12 languages selected from the original OPUS-100 dataset (Zhang et al. 2020). The WMT dataset with unbalanced data distribution is collected from the WMT 14, WMT 16 and WMT 18 benchmarks.
Dataset Splits Yes The held-out multi-way aligned validation data for measuring gradient similarities contains 4, 000 sentences for each language, and are randomly selected and excluded from the training set.
Hardware Specification Yes All the models are trained and tested on a single Nvidia V100 GPU.
Software Dependencies No The paper mentions using the Transformer architecture, Adam optimizer, byte-pair encoding (BPE), and Sacre BLEU, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We conduct our experiments with the Transformer architecture and adopt the transformer base setting which includes 6 encoder and decoder layers, 512/2048 hidden dimensions and 8 attention heads. Dropout (p = 0.1) and label smoothing (ϵls = 0.1) are applied during training but disabled during validation and inference. Each mini-batch contains roughly 8, 192 tokens. We accumulate gradients and update the model every 4 steps for OPUS and 8 steps for WMT to simulate multi-GPU training. In inference, we use beam search with the beam size of 4 and the length penalty of 0.6. The total training step Q is set to 400k for all experiments, and the differentiation happens every N = 8000 steps of training. We set the expected model size to O = 2 O0, 2 times of original model.