DSD: Dense-Sparse-Dense Training for Deep Neural Networks
Authors: Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classification, caption generation and speech recognition. On Image Net, DSD improved the Top1 accuracy of Goog Le Net by 1.1%, VGG-16 by 4.3%, Res Net-18 by 1.2% and Res Net-50 by 1.1%, respectively. On the WSJ 93 dataset, DSD improved Deep Speech and Deep Speech2 WER by 2.0% and 1.1%. On the Flickr-8K dataset, DSD improved the Neural Talk BLEU score by over 1.7. |
| Researcher Affiliation | Collaboration | Song Han , Huizi Mao, Enhao Gong, Shijian Tang, William J. Dally Stanford University {songhan,huizi,enhaog,sjtang,dally}@stanford.edu Jeff Pool , John Tran, Bryan Catanzaro NVIDIA {jpool,johntran,bcatanzaro}@nvidia.com Sharan Narang , Erich Elsen Baidu Research sharan@baidu.com Peter Vajda, Manohar Paluri Facebook {vajdap,mano}@fb.com |
| Pseudocode | Yes | Algorithm 1: Workflow of DSD training |
| Open Source Code | Yes | DSD models are available to download at https://songhan.github.io/DSD. |
| Open Datasets | Yes | On Image Net, DSD improved... On the WSJ 93 dataset, DSD improved... On the Flickr-8K dataset, DSD improved... Cifar-10 is a smaller image recognition benchmark with 50,000 32x32 color images for training and 10,000 for testing. ... (Facebook, 2016). |
| Dataset Splits | Yes | The training dataset used for this model is the Wall Street Journal (WSJ), which contains 81 hours of speech. The validation set consists of 1 hour of speech. The test sets are from WSJ 92 and WSJ 93 and contain 1 hour of speech combined. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like 'Caffe Model Zoo' and 'Torch' for obtaining baseline models, but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For the convolutional networks, we do not prune the first layer during the sparse phase... The sparsity is the same for all the other layers, including convolutional and fully-connected layers. We do not change any other training hyper-parameters, and the initial learning rate at each stage is decayed the same as conventional training. The epochs are decided by when the loss converges. |