Weightless: Lossy weight encoding for deep neural network compression

Authors: Brandon Reagan, Udit Gupta, Bob Adolf, Michael Mitzenmacher, Alexander Rush, Gu-Yeon Wei, David Brooks

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Weightless on three networks commonly used to study compression: Le Net-300-100, Le Net5 (Le Cun et al., 1998), and VGG-16 (Simonyan & Zisserman, 2015). The Le Net networks use MNIST (Lecun & Cortes, 1998) and VGG-16 uses Image Net (Russakovsky et al., 2015). Table 1. Experimental Setup.
Researcher Affiliation Collaboration 1Harvard University, Cambridge, MA 2Facebook, Menlo Park, CA. Correspondence to: Brandon Reagen <reagen@fas.harvard.edu>.
Pseudocode Yes Algorithm 1 Weightless compression method
Open Source Code No The paper mentions using Keras and implementing the Bloomier filter in-house but does not provide any statement or link indicating that the source code for their Weightless method is publicly available.
Open Datasets Yes The Le Net networks use MNIST (Lecun & Cortes, 1998) and VGG-16 uses Image Net (Russakovsky et al., 2015).
Dataset Splits No The paper mentions 'model validation error' and refers to retraining, but it does not provide specific details on the dataset splits (e.g., percentages, sample counts, or explicit methodology for training, validation, and test sets).
Hardware Specification Yes On a Intel i7-6700K CPU reconstructing (decoding) the largest layers of each model takes 0.52, 1.3, and 22.8 seconds for MNIST-300-100, Le Net5, and VGG-16 respectively; on the ARM A53 mobile class CPU used in smartphones since 2014 (Qualcomm, 2018), the same layers take 7.1, 18, and 296 seconds to reconstruct.
Software Dependencies No The paper mentions 'Keras (Chollet, 2017)' and 'Mersenne Twister pseudorandom number generator' but does not provide specific version numbers for Keras or other software dependencies.
Experiment Setup Yes Table 1 shows the models and simplification parameters used in our experiments. We apply Weightless to the largest layers in each model. ... Weights are pruned using either magnitude threshold or dynamic network surgery (see Section 3.2). Once pruned, weights are clustered with k-means. We found that careful choice of initial seeds helped to minimizing the number of clusters needed. We use density-based initialization on a per-layer basis, where initial cluster values are assigned based on the input weight distribution. Tuning the filter size The use of Bloomier filters introduces an additional hyperparameter t that sets the filters encoding strength... Experimentally, for the models considered, we find that t typically falls in the range of 6 to 9.