Frivolous Units: Wider Networks Are Not Really That Wide

Authors: Stephen Casper, Xavier Boix, Vanessa D'Amario, Ling Guo, Martin Schrimpf, Kasper Vinken, Gabriel Kreiman6921-6929

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A remarkable characteristic of overparameterized deep neural networks (DNNs) is that their accuracy does not degrade when the network width is increased. Recent evidence suggests that developing compressible representations allows the complexity of large networks to be adjusted for the learning task at hand. However, these representations are poorly understood. A promising strand of research inspired from biology involves studying representations at the unit level as it offers a more granular interpretation of the neural mechanisms. In order to better understand what facilitates increases in width without decreases in accuracy, we ask: Are there mechanisms at the unit level by which networks control their effective complexity? If so, how do these depend on the architecture, dataset, and hyperparameters? We identify two distinct types of frivolous units that proliferate when the network s width increases: prunable units which can be dropped out of the network without significant change to the output and redundant units whose activities can be expressed as a linear combination of others. These units imply complexity constraints as the function the network computes could be expressed without them. We also identify how the development of these units can be influenced by architecture and a number of training factors. Together, these results help to explain why the accuracy of DNNs does not degrade when width is increased and highlight the importance of frivolous units toward understanding implicit regularization in DNNs.
Researcher Affiliation Academia Stephen Casper, 1,2 Xavier Boix, 1,2,3 Vanessa D Amario,3 Ling Guo,4 Martin Schrimpf,2,3 Kasper Vinken,1,2 Gabriel Kreiman1,2 1Boston Children s Hospital, Harvard Medical School, USA 2Center for Brains, Minds, and Machines (CBMM) 3Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, USA 4Neuroscience Graduate Program, University of California San Francisco, USA
Pseudocode No The paper states, "In the Appendix, we provide algorithmic details for removing the redundant units and refactoring the outgoing weights of the non-redundant ones." However, no pseudocode or algorithm block is present in the main body of the paper provided.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets Yes For larger scale experiments, we used the Image Net (Russakovsky et al. 2015) and CIFAR-10 (Krizhevsky, Hinton et al. 2009) datasets.
Dataset Splits No The paper mentions "Image Net Validation" in Figure 1, but does not provide specific details about the validation dataset split (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification Yes Due to hardware limitations (we used a dgx1 with 8x NVIDIA V100 GPUs 32GB)
Software Dependencies No The paper does not specify version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup Yes Table 1: Network training and performance details: BN refers to batch normalization, DA refers to data augmentation, DO refers to dropout, and WD refers to L2 weight decay. Best refers to learning rate/batch size combinations that achieved the highest accuracy. Stars ( ) indicate factors for which we tested multiple hyperparameters/variants. ... For all networks, increasing model size resulted in equal or improved performance as shown in Fig. 1. ... test three common methods of weight initialization. Fig. 4c presents results for Alex Nets trained with Glorot (Glorot and Bengio 2010), He (He et al. 2015), and Lecun (Le Cun et al. 2012) initializations which each initialize weights using Gaussian distributions with variances depending on the layer widths.