Fast Parallel Training of Neural Language Models
Authors: Tong Xiao, Jingbo Zhu, Tongran Liu, Chunliang Zhang
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we present a sampling-based approach to reducing data transmission for better scaling of NLMs. As a bonus , the resulting model also improves the training speed on a single device. Our approach yields significant speed improvements on a recurrent neural network-based language model. On four NVIDIA GTX1080 GPUs, it achieves a speedup of 2.1+ times over the standard asynchronous stochastic gradient descent baseline, yet with no increase in perplexity. This is even 4.2 times faster than the naive single GPU counterpart. We experimented with the proposed approach in a recurrent neural network-based language model. |
| Researcher Affiliation | Academia | Niu Trans Lab., Northeastern University, Shenyang 110819, China Institute of Psychology (CAS), Beijing 100101, China {xiaotong,zhujingbo,zhangcl}@mail.neu.edu.cn, liutr@psych.ac.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Figure 1 illustrates the model architecture and parallel training, but it is a diagram, not pseudocode. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. There is no mention of a code repository link, an explicit code release statement, or code being available in supplementary materials. |
| Open Datasets | Yes | Three different sized tasks were chosen for training and evaluation. Penn Treebank (PTB). It is the standard data set used in evaluating LMs. We used sections 00-20 as the training data, sections 21-22 as the validation data, and sections 23-24 as the test data. FBIS. We used the Chinese side of the FBIS corpus (LDC2003E14) for a medium sized task. Xinhua. For large-scale training, we generated a 4.5 million sentence set from the Xinhua portion of the English Gigaword (LDC2011T07). |
| Dataset Splits | Yes | Penn Treebank (PTB). We used sections 00-20 as the training data, sections 21-22 as the validation data, and sections 23-24 as the test data. FBIS. We extracted a 4,000 sentence data set for validation and a 4,027 sentence data set for test. Xinhua. The validation and test sets (5,050 and 5,100 sentences) were from the same source but with no overlap with the training data. Table 1: Data and model settings. ... validation (words), test (words). |
| Hardware Specification | Yes | On four NVIDIA GTX1080 GPUs... We ran all experiments on a machine with four NVIDIA GTX1080 GPUs. |
| Software Dependencies | No | The paper mentions using long short-term memory (LSTM) and asynchronous SGD but does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the experiments. |
| Experiment Setup | Yes | Table 1: Data and model settings. ... embedding size 512 512 1024 hidden size 512 512 1024 minibatch size 64 64 64. The weights in all the networks were initialized with a uniform distribution in [-0.1, 0.1]. The gradients were clipped so that their norm was bounded by 3.0. For all experiments, training was iterated for 20 epochs. We started with a learning rate of 0.7 and then halved the learning rate if the perplexity increased on the validation set. In our sampling-based approach, we set µ = 5% by default. For softmax layers, we set p = 10% and q = 10%. For hidden layers, we set q = 90%. |