AutoInit: Analytic Signal-Preserving Weight Initialization for Neural Networks

Authors: Garrett Bingham, Risto Miikkulainen

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Auto Init improves performance of convolutional, residual, and transformer networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings, and does so more reliably than data-dependent initialization methods.
Researcher Affiliation Collaboration Garrett Bingham1, 2 and Risto Miikkulainen1, 2 1 The University of Texas at Austin, Austin, TX 78712 2 Cognizant AI Labs, San Francisco, CA 94105 bingham@cs.utexas.edu, risto@cs.utexas.edu
Pseudocode Yes Algorithm 1: Auto Init Input: Network with layers L, directed edges E output layers = {l L | (l, l ) / E l L} for output layer in output layers do initialize(output layer) def initialize(layer): layers in = {l L | (l, layer) E} i = 1 for layer in in layers in do µini, νini = initialize(layer in) i = i + 1 µin = (µin1, µin2, . . . , µin N ) νin = (νin1, νin2, . . . , νin N ) if layer has weights θ then initialize θ s.t. glayer,θ(µin, νin) = (0, 1) µout, νout = 0, 1 else µout, νout = glayer(µin, νin) return µout, νout
Open Source Code Yes The Auto Init package provides a wrapper around Tensor Flow models and is available at https://github.com/cognizant-ai-labs/autoinit.
Open Datasets Yes The network is trained on the CIFAR-10 dataset (Krizhevsky, Hinton et al. 2009) using the standard setup (Appendix B). The model is trained on Imagenette, a subset of 10 classes from the Image Net dataset (Howard 2019; Deng et al. 2009). Res Net-50 was trained from scratch on Image Net... Tasks Using Co Deep NEAT, networks are evolved for their performance in vision (MNIST), language (Wikipedia Toxicity), tabular (PMLB Adult), multi-task (Omniglot), and transfer learning (Oxford 102 Flower) tasks (Appendix C).
Dataset Splits Yes To avoid overfitting to the test set when evaluating such a large number of activation functions, the accuracy with a balanced validation set of 5000 images is reported instead.
Hardware Specification No The paper mentions 'Appendix F: Computing Infrastructure' for details, but no specific hardware models (like GPU/CPU names or types) are detailed within the main text.
Software Dependencies No The abstract mentions 'Tensor Flow models', but no specific version number is provided for TensorFlow or any other software dependency.
Experiment Setup Yes Hyperparameter Variation In separate experiments, the activation function, dropout rate, weight decay, and learning rate multiplier were changed. In particular, the baseline comparison is the Glorot Uniform strategy (also called Xavier initialization; Glorot and Bengio 2010), where weights are sampled from U 6 fan in+fan out, 6 fan in+fan out. In particular, the initialization is He Normal (He et al. 2015), where weights are sampled from N(0, p2/fan in). All schedules included a linear warm-up phase followed by a decay to zero using cosine annealing (Loshchilov and Hutter 2016). The default initialization, which initializes convolutional layers from N(0, p2/fan out) and fully-connected layers from U 6 fan in+fan out, 6 fan in+fan out . With the default initialization, weights were sampled from U 6 fan in+fan out, 6 fan in+fan out (Glorot and Bengio 2010). With Auto Init, the weights were sampled from N 0, 1/pfan inµf to account for an arbitrary activation function f; the dropout adjustment (Section 5) was not used. Finally, Auto Init++ takes advantage of f as described above, but is otherwise identical to Auto Init.