Are wider nets better given the same number of parameters?

Authors: Anna Golubeva, Guy Gur-Ari, Behnam Neyshabur

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as the model achieves high training accuarcy. In this section, we first explain our experimental methodology and then investigate the effectiveness of different approaches to increase width while keeping the number of parmeters fixed. Figure 1: Test accuracy of Res Net-18 as a function of width.
Researcher Affiliation Collaboration Anna Golubeva Perimeter Institute for Theoretical Physics Waterloo, Canada agolubeva@pitp.ca Behnam Neyshabur Blueshift, Alphabet Mountain View, CA neyshabur@google.com Guy Gur-Ari Blueshift, Alphabet Mountain View, CA guyga@google.com
Pseudocode Yes B SPARSITY DISTRIBUTION CODE The following code implements our algorithm for distributing sparsity over model layers. Figure 4 illustrates the procedure. def get_ntf(num_to_freeze_tot, num_W, tensor_dims, lnames_sorted):
Open Source Code Yes Code is available at https://github.com/google-research/wide-sparse-nets
Open Datasets Yes We train families of Res Net-18 models on Image Net, CIFAR-10, CIFAR-100 and SVHN, covering a range of widths and model sizes. We study a fully-connected network with one hidden layer trained on MNIST.
Dataset Splits No No explicit statement providing specific percentages or sample counts for train/validation/test splits, or clear references to predefined splits for all datasets, was found.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were found.
Software Dependencies No No version numbers are given for any software. In all experiments, we use a standard Py Torch implementation of the Res Net-18 model.
Experiment Setup Yes All models are trained using SGD with momentum=0.9, Cross-Entropy loss, and initial learning rate 0.1. The learning rate value and schedule were tuned for the smallest baseline model. We do not apply early stopping, and we report the best achieved test accuracy. For Image Net, we use weight decay 1e-4, cosine learning rate schedule, and train for 150 epochs. ... For other datasets, we use weight decay 5e-4, train for 300 epochs, and the initial learning rate 0.1 is decayed at epochs 50, 120 and 200 with gamma=0.1.