Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip
Authors: Feiwen Zhu, Jeff Pool, Michael Andersch, Jeremy Appleyard, Fung Xie
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We achieve speedups of over 6 over the next best algorithm for a hidden layer of size 2304, batch size of 4, and a density of 30%. Further, our technique allows for models of over 5 the size to fit on a GPU for a speedup of 2 , enabling larger networks to help advance the state-of-the-art. We perform case studies on NMT and speech recognition tasks in the appendix, accelerating their recurrent layers by up to 3 . |
| Researcher Affiliation | Industry | Feiwen Zhu , Jeff Pool , Michael Andersch, Jeremy Appleyard & Fung Xie NVIDIA {mzhu,jpool,mandersch,jappleyard,ftse}@nvidia.com |
| Pseudocode | Yes | APPENDIX A: ALGORITHM FOR BANK-AWARE WEIGHT LAYOUT... Algorithm 1: Optimize a row of nonzero weights to minimize bank conflicts |
| Open Source Code | No | The paper describes its methods and algorithms but does not include any explicit statement about making its source code publicly available or provide a repository link for the methodology described. |
| Open Datasets | Yes | We use Open NMT (Klein et al., 2017) to perform translation from English to German using the WMT15 data set as our training data and the newstest2013 data set for validation. |
| Dataset Splits | Yes | We use Open NMT (Klein et al., 2017) to perform translation from English to German using the WMT15 data set as our training data and the newstest2013 data set for validation. |
| Hardware Specification | Yes | Our sparse persistent code is compiled in CUDA 9.0, and all tests are run on a NVIDIA Tesla V100. |
| Software Dependencies | Yes | Our sparse persistent code is compiled in CUDA 9.0, and all tests are run on a NVIDIA Tesla V100. |
| Experiment Setup | Yes | Table 1: A naïve implementation has limited performance; our optimizations are necessary to achieve good results. (Layer size = 1152, batch size = 4, density = 10%, #timesteps = 256.) |