reproducibilityindex.ai

Parameter Differentiation Based Multilingual Neural Machine Translation

Authors: Qian Wang, Jiajun Zhang11440-11448

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multilingual datasets have demonstrated that our method signiﬁcantly outperforms various strong baselines with different parameter sharing conﬁgurations.
Researcher Affiliation	Academia	Qian Wang1,2, Jiajun Zhang1,2* 1National Laboratory of Pattern Recognition, Institute of Automation, CAS 2School of Artiﬁcial Intelligence, University of Chinese Academy of Sciences {qian.wang, jjzhang}@nlpr.ia.cn
Pseudocode	Yes	Algorithm 1: Parameter Differentiation
Open Source Code	No	The paper does not provide an explicit statement or link for the open-sourcing of the methodology's code.
Open Datasets	Yes	We use the public OPUS and WMT multilingual datasets to evaluate our method on many-to-one (M2O) and one-to-many (O2M) translation scenarios, and the IWSLT datasets for the many-to-many (M2M) translation scenario. The OPUS dataset consists of English to 12 languages selected from the original OPUS-100 dataset (Zhang et al. 2020). The WMT dataset with unbalanced data distribution is collected from the WMT 14, WMT 16 and WMT 18 benchmarks.
Dataset Splits	Yes	The held-out multi-way aligned validation data for measuring gradient similarities contains 4, 000 sentences for each language, and are randomly selected and excluded from the training set.
Hardware Specification	Yes	All the models are trained and tested on a single Nvidia V100 GPU.
Software Dependencies	No	The paper mentions using the Transformer architecture, Adam optimizer, byte-pair encoding (BPE), and Sacre BLEU, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup	Yes	We conduct our experiments with the Transformer architecture and adopt the transformer base setting which includes 6 encoder and decoder layers, 512/2048 hidden dimensions and 8 attention heads. Dropout (p = 0.1) and label smoothing (ϵls = 0.1) are applied during training but disabled during validation and inference. Each mini-batch contains roughly 8, 192 tokens. We accumulate gradients and update the model every 4 steps for OPUS and 8 steps for WMT to simulate multi-GPU training. In inference, we use beam search with the beam size of 4 and the length penalty of 0.6. The total training step Q is set to 400k for all experiments, and the differentiation happens every N = 8000 steps of training. We set the expected model size to O = 2 O0, 2 times of original model.