reproducibilityindex.ai

Training Compressed Fully-Connected Networks with a Density-Diversity Penalty

Authors: Shengjie Wang, Haoran Cai, Jeff Bilmes, William Noble

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On two separate tasks, computer vision and speech recognition, we demonstrate that the proposed density-diversity penalty signiﬁcantly reduces the diversity and increases the sparsity of the models, while keeping the performance almost unchanged. (From the abstract, further detailed in Sections 4, 4.1, and 4.2 discussing MNIST and TIMIT datasets with performance metrics)
Researcher Affiliation	Academia	Shengjie Wang Department of CSE University of Washington wangsj@cs.washington.edu Haoran Cai Department of Statistics University of Washington haoran@uw.edu Jeff Bilmes Department of EE, CSE University of Washington bilmes@uw.edu William Noble Department of GS, CSE University of Washington william-noble@u.washington.edu
Pseudocode	Yes	Algorithm 1: Sorting Trick for Efﬁciently Calculating the Gradient of Density Diversity Penalty on Weight Matrix Wj : DP(Wj)
Open Source Code	No	No explicit statement or link indicating the release of open-source code for the methodology described in the paper was found. The paper mentions modifying the mxnet package but does not provide access to their specific modifications: 'For our implementation, we start with the mxnet (Chen et al., 2015a) package, which we modiﬁed by changing the weight updating code to include our density-diversity penalty.'
Open Datasets	Yes	We apply the density-diversity penalty (with p = 2 for now) to the fully-connected layers of the models on both the MNIST (computer vision) and TIMIT (speech recognition) datasets, and get signiﬁcantly sparser and less diverse layer weights. (From Section 4) And: The MNIST dataset consists of hand-written digits, containing 60000 training data points and 10000 test data points. (From Section 4.1) And: The TIMIT dataset is for a speech recognition task. The dataset consists of a 462 speaker training set, a 50 speaker validation set, and a 24 speaker test set. (From Section 4.2)
Dataset Splits	Yes	The MNIST dataset consists of hand-written digits, containing 60000 training data points and 10000 test data points. We further sequester 10000 data points from the training data to be used as the validation set for parameter tuning. (From Section 4.1) And: The TIMIT dataset is for a speech recognition task. The dataset consists of a 462 speaker training set, a 50 speaker validation set, and a 24 speaker test set. (From Section 4.2)
Hardware Specification	No	No specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments were provided in the paper.
Software Dependencies	No	The paper mentions 'mxnet (Chen et al., 2015a)' and 'Theano (Theano Development Team, 2016)' as software used, but does not provide specific version numbers for these or any other ancillary software components. For example: 'For our implementation, we start with the mxnet (Chen et al., 2015a) package...' and 'neural network toolkits such as Theano (Theano Development Team, 2016) and mxnet (Chen et al., 2015a).'
Experiment Setup	Yes	For optimization, we use SGD with momentum. (From Section 4.1) And: We choose Re LU as the activation function and Ada Grad (Duchi et al., 2011) for optimization (From Section 4.2). Additionally: we randomly initialize every weight matrix with 10% sparsity (i.e., 90% of weight matrix entries are non-zero). (From Section 3.2) And: for every mini-batch, we only apply the density-diversity penalty with a certain small probability (e.g. 1% to 5%). (From Section 3.1) And: we truncate the weight matrix entries to have a limited number of decimal digits (e.g. 6). (From Section 3.1) And: In practice, we train each phase for 5 to 10 epochs. (From Section 3.4)