MetaInit: Initializing learning by learning to initialize

Authors: Yann N. Dauphin, Samuel Schoenholz

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on plain and residual networks and show that the algorithm can automatically recover from a class of bad initializations. Meta Init allows us to train networks and achieve performance competitive with the state-of-the-art without batch normalization or residual connections. In particular, we find that this approach outperforms normalization for networks without skip connections on CIFAR-10 and can scale to Resnet-50 models on Imagenet.
Researcher Affiliation Industry Yann N. Dauphin Google AI ynd@google.com Samuel S. Schoenholz Google AI schsam@google.com
Pseudocode Yes Figure 2: Basic Pytorch code for the Meta Init algorithm. import torch def gradient_quotient(loss, params, eps=1e-5): [...] def metainit(model, criterion, x_size, y_size, lr=0.1, momentum=0.9, steps=500, eps=1e-5): [...]
Open Source Code No The paper includes PyTorch code snippets in Figure 2, but it does not contain an explicit statement that the full source code for the methodology is available, nor does it provide a link to a code repository.
Open Datasets Yes We conduct experiments on plain and residual networks and show that the algorithm can automatically recover from a class of bad initializations. Meta Init allows us to train networks and achieve performance competitive with the state-of-the-art without batch normalization or residual connections. In particular, we find that this approach outperforms normalization for networks without skip connections on CIFAR-10 and can scale to Resnet-50 models on Imagenet.
Dataset Splits No The paper mentions using standard datasets like CIFAR-10 and ImageNet, which have predefined splits, and refers to following setups from other papers (e.g., [35], [63]). However, it does not explicitly state the specific train/validation/test split percentages or sample counts within the text provided. For example, it only says 'cross-validation to select the momentum parameter' for Imagenet, but does not detail the validation split.
Hardware Specification Yes We find that successfully running the meta-algorithm for a Resnet-50 model on Imagenet takes 11 minutes on 8 Nvidia V100 GPUs.
Software Dependencies No The paper mentions using 'Py Torch [45]', 'Tensor Flow [1]', or 'JAX [16]' as frameworks that support higher order automatic differentiation. However, it does not specify the version numbers for these software components used in the experiments.
Experiment Setup Yes We use Mixup [62] with α = 1 to regularize all models, combined with Dropout with rate 0.2 for residual networks. For plain networks without normalization, we use gradient norm clipping with the maximum norm set to 1 [11]. We use a cosine learning rate schedule [35] with a single cycle and follow the setup described in that paper. All methods use an initial learning rate of 0.1, except LSUV which required lower learning rates of 0.01 and 0.001 for Wide Resnet 28-10 and Wide Resnet 202-4 respectively. LSUV also uses Delta Orthogonal initialization in convolutional layers for fairness since it is an improvement over Orthogonal initialization. The batch size used for the meta-algorithm is 32. The number of meta-algorithm steps for Wide Resnet 202-4 was reduced to 200 for this specific model. Apart from this, we use the default meta-hyper-parameters.