Initialization and Regularization of Factorized Neural Layers

Authors: Mikhail Khodak, Neil A. Tenenholtz, Lester Mackey, Nicolo Fusi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks
Researcher Affiliation Collaboration Mikhail Khodak Carnegie Mellon University khodak@cmu.edu Neil Tenenholtz, Lester Mackey, Nicol o Fusi Microsoft Research {netenenh,lmackey,fusi}@microsoft.com
Pseudocode No The paper describes methods in prose and mathematical formulations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code to reproduce our results is available here: https://github.com/microsoft/fnl_paper.
Open Datasets Yes In Table 1 we see that the low-rank approach, with SI & FD, dominates at the higher memory settings of Res Net across all three datasets considered, often outperforming even approaches that train an uncompressed model first. It is also close to the best compressed training approach in the lowest memory setting for CIFAR-100 (Krizhevksy, 2009) and Tiny-Image Net (Deng et al., 2009).
Dataset Splits No The paper mentions training and test sets but does not explicitly describe the methodology for creating validation splits, such as percentages or sample counts.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions using 'Py Torch' and refers to several GitHub repositories for code, but it does not specify exact version numbers for any software dependencies.
Experiment Setup Yes All models are trained for 200 epochs with the same optimizer settings as for the unfactorized models; the weight-decay coefficient is left unchanged when replacing by FD. and we use a warmup epoch with a 10 times smaller learning rate for Res Net56 for stability.