reproducibilityindex.ai

Accelerating Neural Machine Translation with Partial Word Embedding Compression

Authors: Fan Zhang, Mei Tu, Jinyao Yan14356-14364

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on the Chinese-to-English translation task show that our method can reduce 74.35% of parameters of the word embedding and 74.42% of the FLOPs of the softmax layer. Meanwhile, the average BLEU score on the WMT test sets only drops 0.04.
Researcher Affiliation	Collaboration	1 State Key Laboratory of Media Convergence and Communication, Communication University of China 2 Samsung Research China Beijing (SRC-B)
Pseudocode	Yes	Algorithm 1: Curriculum Learning for Compression
Open Source Code	No	The paper mentions implementing the Transformer in Open NMT-tf but does not provide a link or statement about releasing the source code for their proposed method (P-VQ).
Open Datasets	Yes	We use the Chinese-to-English WMT corpora as our training corpus. In the evaluation step, we use WMT17, WMT18, and WMT19 test sets to evaluate the translation quality for each method.
Dataset Splits	No	The paper mentions using WMT corpora for training and test sets for evaluation, but does not explicitly specify train/validation/test dataset splits (e.g., percentages or exact counts for a validation set).
Hardware Specification	Yes	With batch size of 128 in the sentence level on 4 P40 GPUs. The speed test environment conﬁguration of GPU is Tesla P40 with CUDA 10.2 while that of CPU is GNU/Linux x86 64.
Software Dependencies	Yes	The speed test environment conﬁguration of GPU is Tesla P40 with CUDA 10.2 while that of CPU is GNU/Linux x86 64.
Experiment Setup	Yes	We use the standard Transformer (Vaswani et al. 2017) as implemented in Open NMT-tf (Klein et al. 2017; Abadi et al. 2016) with 6 layers, 512 embedding dimensions and 8 attention heads as our baseline model. In any training phase, we use the lazy Adam optimizer (Kingma and Ba 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. We use the Noam decay as the learning rate scheduler with 4000 warmup steps and a factor of learning rate of 2.0. With batch size of 128 in the sentence level on 4 P40 GPUs, the training step for baseline and FEP is at least 300k, the ﬁne-tuning step is at least 200k. In predicting phase, we set beam size to 4, and select the best BLEU score of the models in the last 10k steps to represent the model quality. The batch size is set to 1 when testing the real translation speed.