Soft Weight-Sharing for Neural Network Compression

Authors: Karen Ullrich, Edward Meeds, Max Welling

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our compression procedure on two neural network models used in previous work we compare against in our experiments: (a) Le Net-300-100 an MNIST model... (b) Le Net-5-Caffe a modified version of the Le Net-5 MNIST model... (c) Res Nets... 6 EXPERIMENTS
Researcher Affiliation Academia Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Edward Meeds University of Amsterdam tmeeds@gmail.com Max Welling University of Amsterdam Canadian Institute for Advanced Research (CIFAR) welling.max@gmail.com
Pseudocode Yes A summary can be found in Algorithm 1. Algorithm 1 Soft weight-sharing for compression, our proposed algorithm for neural network model compression. It is divided into two main steps: network re-training and post-processing.
Open Source Code Yes ACKNOWLEDGEMENTS We would like to thank Louis Smit, Christos Louizos, Thomas Kipf, Rianne van den Berg and Peter O Connor for helpful discussions on the paper and the public code3. 3https://github.com/Karen Ullrich/Tutorial-Soft Weight Sharing For NNCompression
Open Datasets Yes (a) Le Net-300-100 an MNIST model described in Le Cun et al. (1998). ... (b) Le Net-5-Caffe a modified version of the Le Net-5 MNIST model in Le Cun et al. (1998). ... for CIFAR-10 and CIFAR-100 respectively.
Dataset Splits No The paper mentions using standard datasets like MNIST and CIFAR-10 but does not explicitly provide details on how the training, validation, and test sets were split (e.g., percentages, sample counts, or specific split files).
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments, such as specific GPU models, CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions 'Adam (Kingma & Ba, 2014)' and 'Caffe MNIST tutorial page' but does not specify version numbers for any software dependencies.
Experiment Setup Yes Note that, similar to (Nowlan & Hinton, 1992) we weigh the log-prior contribution to the gradient by a factor of τ = 0.005. ... The remaining learning rates are set to 5 10 4. ... Our Gaussian MM prior is initialized with 24 + 1 = 17 components. We initialize the learning rate for the weights and means, log-variances and log-mixing proportions separately. ... For one component, we fix µj=0 = 0 and πj=0 = 0.999. ... We distribute the means of the 16 non-fixed components evenly over the range of the pre-trained weights. The variances will be initialized such that each Gaussian has significant probability mass in its region. ... The trainable mixing proportions are initialized evenly πj = (1 πj=0)/J.