Learning Intrinsic Sparse Structures within Long Short-Term Memory

Authors: Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method achieves 10.59 speedup without losing any perplexity of a language modeling of Penn Tree Bank dataset. It is also successfully evaluated through a compact model with only 2.69M weights for machine Question Answering of SQu AD dataset. 4 EXPERIMENTS
Researcher Affiliation Collaboration Wei Wen , Yiran Chen & Hai Li Electrical and Computer Engineering, Duke University {wei.wen,yiran.chen,hai.li}@duke.edu Yuxiong He , Samyam Rajbhandari , Minjia Zhang , Wenhan Wang , Fang Liu & Bin Hu Business AI and Bing , Microsoft {yuxhe,samyamr,minjiaz,wenhanw,fangliu,binhu}@microsoft.com
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code Yes Our source code is available1. 1https://github.com/wenwei202/iss-rnns
Open Datasets Yes We evaluated our method by LSTMs and RHNs in language modeling of Penn Treebank dataset (Marcus et al. (1993)) and machine Question Answering of SQu AD dataset (Rajpurkar et al. (2016)).
Dataset Splits Yes Table 1: Learning ISS sparsity from scratch in stacked LSTMs. Method Dropout keep ratio Perplexity (validate, test)
Hardware Specification Yes To measure the inference speed, the experiments were run on a dual socket Intel Xeon CPU E52673 v3 @ 2.40GHz processor with a total of 24 cores (12 per socket) and 128GB of memory.
Software Dependencies Yes Intel MKL library 2017 update 2 was used for matrix-multiplication operations. Open MP runtime was utilized for parallelism. We used Intel C++ Compiler 17.0 to generate executables that were run on Windows Server 2016.
Experiment Setup Yes The same training scheme as the baseline is adopted to learn ISS sparsity, except a larger dropout keep ratio of 0.6 versus 0.35 of the baseline because group Lasso regularization can also avoid over-fitting. All models are trained from scratch for 55 epochs. [...] For a specific application, we preset τ by cross validation. The maximum τ which sparsifies the dense model (baseline) without deteriorating its performance is selected. The validation of τ is performed only once and no training effort is needed. τ is 1.0e 4 for the stacked LSTMs in Penn Tree Bank, and it is 4.0e 4 for the RHN and the Bi DAF model.