Powerpropagation: A sparsity inducing weight reparameterisation

Authors: Jonathan Schwarz, Siddhant Jayakumar, Razvan Pascanu, Peter E Latham, Yee Teh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now provide an experimental comparison of Powerpropagation to a variety of other techniques, both in the sparsity and continual learning settings. Throughout this section we will be guided by three key questions: (i) Can we provide experimental evidence for inherent sparsity? (ii) If so, can Powerprop. be successfully combined with existing sparsity techniques? (iii) Do improvements brought by Powerprop. translate to measurable advances in Continual Learning? and Figure 2 shows this comparison for image classification on the popular CIFAR-10 [67] and Image Net [68] datasets using a smaller version of Alex Net [3] and Res Net50 [4] respectively.
Researcher Affiliation Collaboration Jonathan Schwarz Deep Mind & Gatsby Unit, UCL schwarzjn@google.com Siddhant M. Jayakumar Deep Mind & University College London Razvan Pascanu Deep Mind Peter E. Latham Gatsby Unit, UCL Yee Whye Teh Deep Mind
Pseudocode Yes Algorithm 1: Efficient Pack Net (EPN) + Powerpropagation.
Open Source Code Yes We provide code to reproduce the MNIST results (a) in the accompanying notebook. 3https://github.com/deepmind/deepmind-research/tree/master/powerpropagation
Open Datasets Yes Figure 1a shows the effect of increasing sparsity on the layerwise magnitude-pruning setting for Le Net [40] on MNIST [41]. and Figure 2 shows this comparison for image classification on the popular CIFAR-10 [67] and Image Net [68] datasets
Dataset Splits Yes terminating the search once the sparse model s performance falls short of a minimum accepted target performance γP (computed on a held-out validation set) and Ps E(XT , y T , φ Mt) // Validation performance of sparse model (from Algorithm 1, Line 9).
Hardware Specification No The paper discusses computational costs and efficiency (reducing the computational footprint of models), but does not specify the exact hardware (e.g., GPU/CPU models, types of accelerators) used for running the experiments.
Software Dependencies No The paper mentions using Adam [33] as an optimizer, but does not provide specific version numbers for programming languages, libraries, frameworks (like TensorFlow or PyTorch), or other software components used in the experiments.
Experiment Setup Yes Finally is worth noting that the choice of α does influence the optimal learning rate schedule and best results were obtained after changes to the default schedule. and training for 1M steps with Adam [33] (relying on the formulation in Section 2) on each task while allowing 100k retrain steps for Pack Net. Also, Algorithm 1 specifies Target performance γ [0, 1]; Sparsity rates S = [s1, . . . , sn].