Training Compressed Fully-Connected Networks with a Density-Diversity Penalty
Authors: Shengjie Wang, Haoran Cai, Jeff Bilmes, William Noble
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On two separate tasks, computer vision and speech recognition, we demonstrate that the proposed density-diversity penalty significantly reduces the diversity and increases the sparsity of the models, while keeping the performance almost unchanged. (From the abstract, further detailed in Sections 4, 4.1, and 4.2 discussing MNIST and TIMIT datasets with performance metrics) |
| Researcher Affiliation | Academia | Shengjie Wang Department of CSE University of Washington wangsj@cs.washington.edu Haoran Cai Department of Statistics University of Washington haoran@uw.edu Jeff Bilmes Department of EE, CSE University of Washington bilmes@uw.edu William Noble Department of GS, CSE University of Washington william-noble@u.washington.edu |
| Pseudocode | Yes | Algorithm 1: Sorting Trick for Efficiently Calculating the Gradient of Density Diversity Penalty on Weight Matrix Wj : DP(Wj) |
| Open Source Code | No | No explicit statement or link indicating the release of open-source code for the methodology described in the paper was found. The paper mentions modifying the mxnet package but does not provide access to their specific modifications: 'For our implementation, we start with the mxnet (Chen et al., 2015a) package, which we modified by changing the weight updating code to include our density-diversity penalty.' |
| Open Datasets | Yes | We apply the density-diversity penalty (with p = 2 for now) to the fully-connected layers of the models on both the MNIST (computer vision) and TIMIT (speech recognition) datasets, and get significantly sparser and less diverse layer weights. (From Section 4) And: The MNIST dataset consists of hand-written digits, containing 60000 training data points and 10000 test data points. (From Section 4.1) And: The TIMIT dataset is for a speech recognition task. The dataset consists of a 462 speaker training set, a 50 speaker validation set, and a 24 speaker test set. (From Section 4.2) |
| Dataset Splits | Yes | The MNIST dataset consists of hand-written digits, containing 60000 training data points and 10000 test data points. We further sequester 10000 data points from the training data to be used as the validation set for parameter tuning. (From Section 4.1) And: The TIMIT dataset is for a speech recognition task. The dataset consists of a 462 speaker training set, a 50 speaker validation set, and a 24 speaker test set. (From Section 4.2) |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions 'mxnet (Chen et al., 2015a)' and 'Theano (Theano Development Team, 2016)' as software used, but does not provide specific version numbers for these or any other ancillary software components. For example: 'For our implementation, we start with the mxnet (Chen et al., 2015a) package...' and 'neural network toolkits such as Theano (Theano Development Team, 2016) and mxnet (Chen et al., 2015a).' |
| Experiment Setup | Yes | For optimization, we use SGD with momentum. (From Section 4.1) And: We choose Re LU as the activation function and Ada Grad (Duchi et al., 2011) for optimization (From Section 4.2). Additionally: we randomly initialize every weight matrix with 10% sparsity (i.e., 90% of weight matrix entries are non-zero). (From Section 3.2) And: for every mini-batch, we only apply the density-diversity penalty with a certain small probability (e.g. 1% to 5%). (From Section 3.1) And: we truncate the weight matrix entries to have a limited number of decimal digits (e.g. 6). (From Section 3.1) And: In practice, we train each phase for 5 to 10 epochs. (From Section 3.4) |