reproducibilityindex.ai

Persistent RNNs: Stashing Recurrent Weights On-Chip

Authors: Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, Sanjeev Satheesh

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA Titan X GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efﬁciently explore end-to-end speech recognition models with over 100 layers.
Researcher Affiliation	Industry	Gregory Diamos GREGDIAMOS@BAIDU.COM Shubho Sengupta SSENGUPTA@BAIDU.COM Bryan Catanzaro BCATANZARO@BAIDU.COM Mike Chrzanowski MIKECHRZANOWSKI@BAIDU.COM Adam Coates ADAMCOATES@BAIDU.COM Erich Elsen ERICHELSEN@BAIDU.COM Jesse Engel JENGEL@BAIDU.COM Awni Hannun AWNIHANNUN@BAIDU.COM Sanjeev Satheesh SANJEEVSATHEESTH@BAIDU.COM Baidu Silicon Valley AI Lab, 1195 Bordeaux Drive, Sunnyvale, CA 94089, UNITED STATES
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code	Yes	An open-source implementation of the Persistent RNN GPU kernels has been released (Diamos et al.). Persistent RNNs. https://github.com/baidu-research/ persistent-rnn. Accessed: 2016-05-23.
Open Datasets	No	The paper mentions using 'a dataset of 500 hours of audio' and an 'English speaker held out development set' which is an 'internal dataset containing 2048 utterances'. No specific public dataset names, links, DOIs, or citations to a public source are provided for these datasets.
Dataset Splits	No	The paper mentions using a 'held out development set' and reports evaluation on it, but it does not specify explicit train/validation/test split percentages, total sample counts for each split, or reference a predefined split with a citation that would allow reproduction of the data partitioning.
Hardware Specification	Yes	Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA Titan X GPU. Our cluster is composed of nodes with 8 GPUs and 2 CPUs. GPUs are connected locally via PCIe v3 using two 4-wide full bisection bandwidth PCIe switches, which are interconnected using the QPI bus between CPUs. Nodes are interconnected by Inﬁniband 12x QDR links to a full bisection bandwidth router.
Software Dependencies	No	The paper mentions software like 'Nervana Systems GEMM kernels', 'NVIDIA and Nervana Systems BLAS libraries', 'CUDA and Open CL development frameworks', and 'MPI', but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	All models are trained for 20 epochs on the English dataset. We use stochastic gradient descent with Nesterov momentum (Sutskever et al., 2013) along with a minibatch from the range of [64, 512] utterances. If the norm of the gradient exceeds the threshold of 400, it is rescaled to 400 (Pascanu et al., 2012). The learning rate is chosen from the range [1 10 5, 6 10 4] to yield the fastest convergence and annealed by a constant factor of 1.2 after each epoch. We use a momentum of 0.99 for all models.