Accelerating Neural Machine Translation with Partial Word Embedding Compression
Authors: Fan Zhang, Mei Tu, Jinyao Yan14356-14364
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the Chinese-to-English translation task show that our method can reduce 74.35% of parameters of the word embedding and 74.42% of the FLOPs of the softmax layer. Meanwhile, the average BLEU score on the WMT test sets only drops 0.04. |
| Researcher Affiliation | Collaboration | 1 State Key Laboratory of Media Convergence and Communication, Communication University of China 2 Samsung Research China Beijing (SRC-B) |
| Pseudocode | Yes | Algorithm 1: Curriculum Learning for Compression |
| Open Source Code | No | The paper mentions implementing the Transformer in Open NMT-tf but does not provide a link or statement about releasing the source code for their proposed method (P-VQ). |
| Open Datasets | Yes | We use the Chinese-to-English WMT corpora as our training corpus. In the evaluation step, we use WMT17, WMT18, and WMT19 test sets to evaluate the translation quality for each method. |
| Dataset Splits | No | The paper mentions using WMT corpora for training and test sets for evaluation, but does not explicitly specify train/validation/test dataset splits (e.g., percentages or exact counts for a validation set). |
| Hardware Specification | Yes | With batch size of 128 in the sentence level on 4 P40 GPUs. The speed test environment configuration of GPU is Tesla P40 with CUDA 10.2 while that of CPU is GNU/Linux x86 64. |
| Software Dependencies | Yes | The speed test environment configuration of GPU is Tesla P40 with CUDA 10.2 while that of CPU is GNU/Linux x86 64. |
| Experiment Setup | Yes | We use the standard Transformer (Vaswani et al. 2017) as implemented in Open NMT-tf (Klein et al. 2017; Abadi et al. 2016) with 6 layers, 512 embedding dimensions and 8 attention heads as our baseline model. In any training phase, we use the lazy Adam optimizer (Kingma and Ba 2014) with β1 = 0.9, β2 = 0.98 and ϵ = 10 9. We use the Noam decay as the learning rate scheduler with 4000 warmup steps and a factor of learning rate of 2.0. With batch size of 128 in the sentence level on 4 P40 GPUs, the training step for baseline and FEP is at least 300k, the fine-tuning step is at least 200k. In predicting phase, we set beam size to 4, and select the best BLEU score of the models in the last 10k steps to represent the model quality. The batch size is set to 1 when testing the real translation speed. |