Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip

Authors: Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard, Fung Xie

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We achieve speedups of over 6 over the next best algorithm for a hidden layer of size 2304, batch size of 4, and a density of 30%. Further, our technique allows for models of over 5 the size to fit on a GPU for a speedup of 2 , enabling larger networks to help advance the state-of-the-art. We perform case studies on NMT and speech recognition tasks in the appendix, accelerating their recurrent layers by up to 3 .
Researcher Affiliation Industry Feiwen Zhu , Jeff Pool , Michael Andersch, Jeremy Appleyard & Fung Xie NVIDIA {mzhu,jpool,mandersch,jappleyard,ftse}@nvidia.com
Pseudocode Yes APPENDIX A: ALGORITHM FOR BANK-AWARE WEIGHT LAYOUT... Algorithm 1: Optimize a row of nonzero weights to minimize bank conflicts
Open Source Code No The paper describes its methods and algorithms but does not include any explicit statement about making its source code publicly available or provide a repository link for the methodology described.
Open Datasets Yes We use Open NMT (Klein et al., 2017) to perform translation from English to German using the WMT15 data set as our training data and the newstest2013 data set for validation.
Dataset Splits Yes We use Open NMT (Klein et al., 2017) to perform translation from English to German using the WMT15 data set as our training data and the newstest2013 data set for validation.
Hardware Specification Yes Our sparse persistent code is compiled in CUDA 9.0, and all tests are run on a NVIDIA Tesla V100.
Software Dependencies Yes Our sparse persistent code is compiled in CUDA 9.0, and all tests are run on a NVIDIA Tesla V100.
Experiment Setup Yes Table 1: A naïve implementation has limited performance; our optimizations are necessary to achieve good results. (Layer size = 1152, batch size = 4, density = 10%, #timesteps = 256.)