reproducibilityindex.ai

Learning Intrinsic Sparse Structures within Long Short-Term Memory

Authors: Wei Wen, Yuxiong He, Samyam Rajbhandari, Minjia Zhang, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, Hai Li

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method achieves 10.59 speedup without losing any perplexity of a language modeling of Penn Tree Bank dataset. It is also successfully evaluated through a compact model with only 2.69M weights for machine Question Answering of SQu AD dataset. 4 EXPERIMENTS
Researcher Affiliation	Collaboration	Wei Wen , Yiran Chen & Hai Li Electrical and Computer Engineering, Duke University {wei.wen,yiran.chen,hai.li}@duke.edu Yuxiong He , Samyam Rajbhandari , Minjia Zhang , Wenhan Wang , Fang Liu & Bin Hu Business AI and Bing , Microsoft {yuxhe,samyamr,minjiaz,wenhanw,fangliu,binhu}@microsoft.com
Pseudocode	No	No pseudocode or algorithm blocks were found.
Open Source Code	Yes	Our source code is available1. 1https://github.com/wenwei202/iss-rnns
Open Datasets	Yes	We evaluated our method by LSTMs and RHNs in language modeling of Penn Treebank dataset (Marcus et al. (1993)) and machine Question Answering of SQu AD dataset (Rajpurkar et al. (2016)).
Dataset Splits	Yes	Table 1: Learning ISS sparsity from scratch in stacked LSTMs. Method Dropout keep ratio Perplexity (validate, test)
Hardware Specification	Yes	To measure the inference speed, the experiments were run on a dual socket Intel Xeon CPU E52673 v3 @ 2.40GHz processor with a total of 24 cores (12 per socket) and 128GB of memory.
Software Dependencies	Yes	Intel MKL library 2017 update 2 was used for matrix-multiplication operations. Open MP runtime was utilized for parallelism. We used Intel C++ Compiler 17.0 to generate executables that were run on Windows Server 2016.
Experiment Setup	Yes	The same training scheme as the baseline is adopted to learn ISS sparsity, except a larger dropout keep ratio of 0.6 versus 0.35 of the baseline because group Lasso regularization can also avoid over-fitting. All models are trained from scratch for 55 epochs. [...] For a speciﬁc application, we preset τ by cross validation. The maximum τ which sparsiﬁes the dense model (baseline) without deteriorating its performance is selected. The validation of τ is performed only once and no training effort is needed. τ is 1.0e 4 for the stacked LSTMs in Penn Tree Bank, and it is 4.0e 4 for the RHN and the Bi DAF model.