reproducibilityindex.ai

DSD: Dense-Sparse-Dense Training for Deep Neural Networks

Authors: Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classiﬁcation, caption generation and speech recognition. On Image Net, DSD improved the Top1 accuracy of Goog Le Net by 1.1%, VGG-16 by 4.3%, Res Net-18 by 1.2% and Res Net-50 by 1.1%, respectively. On the WSJ 93 dataset, DSD improved Deep Speech and Deep Speech2 WER by 2.0% and 1.1%. On the Flickr-8K dataset, DSD improved the Neural Talk BLEU score by over 1.7.
Researcher Affiliation	Collaboration	Song Han , Huizi Mao, Enhao Gong, Shijian Tang, William J. Dally Stanford University {songhan,huizi,enhaog,sjtang,dally}@stanford.edu Jeff Pool , John Tran, Bryan Catanzaro NVIDIA {jpool,johntran,bcatanzaro}@nvidia.com Sharan Narang , Erich Elsen Baidu Research sharan@baidu.com Peter Vajda, Manohar Paluri Facebook {vajdap,mano}@fb.com
Pseudocode	Yes	Algorithm 1: Workﬂow of DSD training
Open Source Code	Yes	DSD models are available to download at https://songhan.github.io/DSD.
Open Datasets	Yes	On Image Net, DSD improved... On the WSJ 93 dataset, DSD improved... On the Flickr-8K dataset, DSD improved... Cifar-10 is a smaller image recognition benchmark with 50,000 32x32 color images for training and 10,000 for testing. ... (Facebook, 2016).
Dataset Splits	Yes	The training dataset used for this model is the Wall Street Journal (WSJ), which contains 81 hours of speech. The validation set consists of 1 hour of speech. The test sets are from WSJ 92 and WSJ 93 and contain 1 hour of speech combined.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments.
Software Dependencies	No	The paper mentions software like 'Caffe Model Zoo' and 'Torch' for obtaining baseline models, but it does not specify any version numbers for these or other software dependencies.
Experiment Setup	Yes	For the convolutional networks, we do not prune the ﬁrst layer during the sparse phase... The sparsity is the same for all the other layers, including convolutional and fully-connected layers. We do not change any other training hyper-parameters, and the initial learning rate at each stage is decayed the same as conventional training. The epochs are decided by when the loss converges.