Scalable Model Compression by Entropy Penalized Reparameterization

Authors: Deniz Oktay, Johannes Ballé, Saurabh Singh, Abhinav Shrivastava

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the method on the MNIST, CIFAR-10 and Image Net classification benchmarks using six distinct model architectures. Our results show that state-of-the-art model compression can be achieved in a scalable and general way without requiring complex procedures such as multi-stage training.
Researcher Affiliation Collaboration Deniz Oktay Princeton University Princeton, NJ, USA doktay@cs.princeton.edu Johannes Ballé Google Research Mountain View, CA, USA jballe@google.com Saurabh Singh Google Research Mountain View, CA, USA saurabhsingh@google.com Abhinav Shrivastava University of Maryland, College Park College Park, MD, USA abhinav@cs.umd.edu
Pseudocode No The paper describes the method in prose and through diagrams (Figure 2, Figure 3), but does not contain a formal pseudocode or algorithm block.
Open Source Code Yes In addition, our code is publicly available4. 4Refer to examples in https://github.com/tensorflow/compression.
Open Datasets Yes We evaluate the method on the MNIST, CIFAR-10 and Image Net classification benchmarks... Le Net300-100 (Lecun et al., 1998) and Le Net-5-Caffe2 on MNIST (Le Cun and Cortes, 2010), as well as VGG-163 (Simonyan and Zisserman, 2015) and Res Net-20 (He et al., 2016b; Zagoruyko and Komodakis, 2016) with width multiplier 4 (Res Net-204) on CIFAR-10 (Zagoruyko and Komodakis, 2016). For our Image Net experiments, we evaluate our method on the Res Net-18 and Res Net-50 (He et al., 2016a) networks.
Dataset Splits No The paper mentions using well-known datasets (MNIST, CIFAR-10, ImageNet) and describes training procedures and evaluation using EMA, but it does not explicitly provide specific details on how the datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or explicit mention of validation set usage).
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or cloud computing instance specifications.
Software Dependencies No The paper mentions software components like 'tensorflow/compression', 'Caffe', and 'Torch', but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We found it useful to use two separate optimizers: one to optimize the variables of the probability models qi, and one to optimize the reparameterizations Φ and variables of the parameter decoders Ψ. While the latter is chosen to be the same optimizer typically used for the task/architecture, the former is always Adam (Kingma and Ba, 2015) with a learning rate of 10 4. ... We train the networks using Adam with a constant learning rate of 0.001 for 200,000 iterations. ... For both VGG-16 and Res Net-20-4, we use momentum of 0.9 with an initial learning rate of 0.1, and decay by 0.2 at iterations 256,000, 384,000, and 448,000 for a total of 512,000 iterations.