Learning Intrinsic Sparse Structures within Long Short-Term Memory
Authors: Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves 10.59 speedup without losing any perplexity of a language modeling of Penn Tree Bank dataset. It is also successfully evaluated through a compact model with only 2.69M weights for machine Question Answering of SQu AD dataset. 4 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Wei Wen , Yiran Chen & Hai Li Electrical and Computer Engineering, Duke University {wei.wen,yiran.chen,hai.li}@duke.edu Yuxiong He , Samyam Rajbhandari , Minjia Zhang , Wenhan Wang , Fang Liu & Bin Hu Business AI and Bing , Microsoft {yuxhe,samyamr,minjiaz,wenhanw,fangliu,binhu}@microsoft.com |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Our source code is available1. 1https://github.com/wenwei202/iss-rnns |
| Open Datasets | Yes | We evaluated our method by LSTMs and RHNs in language modeling of Penn Treebank dataset (Marcus et al. (1993)) and machine Question Answering of SQu AD dataset (Rajpurkar et al. (2016)). |
| Dataset Splits | Yes | Table 1: Learning ISS sparsity from scratch in stacked LSTMs. Method Dropout keep ratio Perplexity (validate, test) |
| Hardware Specification | Yes | To measure the inference speed, the experiments were run on a dual socket Intel Xeon CPU E52673 v3 @ 2.40GHz processor with a total of 24 cores (12 per socket) and 128GB of memory. |
| Software Dependencies | Yes | Intel MKL library 2017 update 2 was used for matrix-multiplication operations. Open MP runtime was utilized for parallelism. We used Intel C++ Compiler 17.0 to generate executables that were run on Windows Server 2016. |
| Experiment Setup | Yes | The same training scheme as the baseline is adopted to learn ISS sparsity, except a larger dropout keep ratio of 0.6 versus 0.35 of the baseline because group Lasso regularization can also avoid over-fitting. All models are trained from scratch for 55 epochs. [...] For a specific application, we preset τ by cross validation. The maximum τ which sparsifies the dense model (baseline) without deteriorating its performance is selected. The validation of τ is performed only once and no training effort is needed. τ is 1.0e 4 for the stacked LSTMs in Penn Tree Bank, and it is 4.0e 4 for the RHN and the Bi DAF model. |