Persistent RNNs: Stashing Recurrent Weights On-Chip
Authors: Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, Sanjeev Satheesh
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA Titan X GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers. |
| Researcher Affiliation | Industry | Gregory Diamos GREGDIAMOS@BAIDU.COM Shubho Sengupta SSENGUPTA@BAIDU.COM Bryan Catanzaro BCATANZARO@BAIDU.COM Mike Chrzanowski MIKECHRZANOWSKI@BAIDU.COM Adam Coates ADAMCOATES@BAIDU.COM Erich Elsen ERICHELSEN@BAIDU.COM Jesse Engel JENGEL@BAIDU.COM Awni Hannun AWNIHANNUN@BAIDU.COM Sanjeev Satheesh SANJEEVSATHEESTH@BAIDU.COM Baidu Silicon Valley AI Lab, 1195 Bordeaux Drive, Sunnyvale, CA 94089, UNITED STATES |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | Yes | An open-source implementation of the Persistent RNN GPU kernels has been released (Diamos et al.). Persistent RNNs. https://github.com/baidu-research/ persistent-rnn. Accessed: 2016-05-23. |
| Open Datasets | No | The paper mentions using 'a dataset of 500 hours of audio' and an 'English speaker held out development set' which is an 'internal dataset containing 2048 utterances'. No specific public dataset names, links, DOIs, or citations to a public source are provided for these datasets. |
| Dataset Splits | No | The paper mentions using a 'held out development set' and reports evaluation on it, but it does not specify explicit train/validation/test split percentages, total sample counts for each split, or reference a predefined split with a citation that would allow reproduction of the data partitioning. |
| Hardware Specification | Yes | Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA Titan X GPU. Our cluster is composed of nodes with 8 GPUs and 2 CPUs. GPUs are connected locally via PCIe v3 using two 4-wide full bisection bandwidth PCIe switches, which are interconnected using the QPI bus between CPUs. Nodes are interconnected by Infiniband 12x QDR links to a full bisection bandwidth router. |
| Software Dependencies | No | The paper mentions software like 'Nervana Systems GEMM kernels', 'NVIDIA and Nervana Systems BLAS libraries', 'CUDA and Open CL development frameworks', and 'MPI', but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | All models are trained for 20 epochs on the English dataset. We use stochastic gradient descent with Nesterov momentum (Sutskever et al., 2013) along with a minibatch from the range of [64, 512] utterances. If the norm of the gradient exceeds the threshold of 400, it is rescaled to 400 (Pascanu et al., 2012). The learning rate is chosen from the range [1 10 5, 6 10 4] to yield the fastest convergence and annealed by a constant factor of 1.2 after each epoch. We use a momentum of 0.99 for all models. |