LEARNING TO SHARE: SIMULTANEOUS PARAMETER TYING AND SPARSIFICATION IN DEEP LEARNING

Authors: Dejiao Zhang, Haozhu Wang, Mario Figueiredo, Laura Balzano

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this approach on several benchmark datasets, showing that it can dramatically compress the network with slight or even no loss on generalization accuracy.
Researcher Affiliation Academia Dejiao Zhang University of Michigan, Ann Arbor, USA dejiao@umich.edu Haozhu Wang University of Michigan, Ann Arbor, USA hzwang@umich.edu M ario A.T. Figueiredo Instituto de Telecomunicac oe and Instituto Superior T ecnico University of Lisbon, Portugal mario.figueiredo@lx.it.pt Laura Balzano University of Michigan, Ann Arbor, USA girasole@umich.edu
Pseudocode Yes The training method is summarized in Algorithm 1. Algorithm 2 Prox Gr OWL Bogdan et al. (2015) for solving proxη,Ωλ(z) Algorithm 3 Affinity Propagation Frey & Dueck (2007)
Open Source Code No The paper does not provide a direct link to the source code for the methodology described, nor does it explicitly state that the code is being released or available in supplementary materials.
Open Datasets Yes We assess the performance of the proposed method on two benchmark datasets: MNIST and CIFAR-10. The MNIST dataset contains centered images of handwritten digits (0 9), of size 28 28 (784) pixels.
Dataset Splits No The paper mentions '10000 training and 1000 testing examples' for synthetic data, and uses MNIST and CIFAR-10, but it does not specify explicit train/validation/test splits, percentages, or validation set sizes for any of the datasets used in the experiments.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No We implement all models using Tensorflow Abadi et al. (2016). In this paper, we use the built-in affinity propagation method of the scikit-learn package (Buitinck et al., 2013). The paper mentions software like TensorFlow and scikit-learn but does not provide specific version numbers for these dependencies.
Experiment Setup Yes For the MNIST experiment: The network is trained for 300 epochs and then retrained for an additional 100 epochs, both with momentum. The initial learning rate is set to 0.001, for both training and retraining, and is reduced by a factor of 0.96 every 10 epochs. We set p = 0.5, and Λ1, Λ2 are selected by grid search. For the CIFAR-10 experiment: We first train the network under different regularizers for 150 epochs, then retrain it for another 50 epochs, using the learning rate decay scheme described by He et al. (2016): the initial rates for the training and retraining phases are set to 0.01 and 0.001, respectively; the learning rate is multiplied by 0.1 every 60 epochs of the training phase, and every 20 epochs of the retraining phase. For Gr OWL (+ℓ2), we set p = 0.1 n (see Eq. (9)) for all layers, where n denotes the number of rows of the (reshaped) weight matrices of each layer.