AutoInit: Analytic Signal-Preserving Weight Initialization for Neural Networks
Authors: Garrett Bingham, Risto Miikkulainen
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that Auto Init improves performance of convolutional, residual, and transformer networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings, and does so more reliably than data-dependent initialization methods. |
| Researcher Affiliation | Collaboration | Garrett Bingham1, 2 and Risto Miikkulainen1, 2 1 The University of Texas at Austin, Austin, TX 78712 2 Cognizant AI Labs, San Francisco, CA 94105 bingham@cs.utexas.edu, risto@cs.utexas.edu |
| Pseudocode | Yes | Algorithm 1: Auto Init Input: Network with layers L, directed edges E output layers = {l L | (l, l ) / E l L} for output layer in output layers do initialize(output layer) def initialize(layer): layers in = {l L | (l, layer) E} i = 1 for layer in in layers in do µini, νini = initialize(layer in) i = i + 1 µin = (µin1, µin2, . . . , µin N ) νin = (νin1, νin2, . . . , νin N ) if layer has weights θ then initialize θ s.t. glayer,θ(µin, νin) = (0, 1) µout, νout = 0, 1 else µout, νout = glayer(µin, νin) return µout, νout |
| Open Source Code | Yes | The Auto Init package provides a wrapper around Tensor Flow models and is available at https://github.com/cognizant-ai-labs/autoinit. |
| Open Datasets | Yes | The network is trained on the CIFAR-10 dataset (Krizhevsky, Hinton et al. 2009) using the standard setup (Appendix B). The model is trained on Imagenette, a subset of 10 classes from the Image Net dataset (Howard 2019; Deng et al. 2009). Res Net-50 was trained from scratch on Image Net... Tasks Using Co Deep NEAT, networks are evolved for their performance in vision (MNIST), language (Wikipedia Toxicity), tabular (PMLB Adult), multi-task (Omniglot), and transfer learning (Oxford 102 Flower) tasks (Appendix C). |
| Dataset Splits | Yes | To avoid overfitting to the test set when evaluating such a large number of activation functions, the accuracy with a balanced validation set of 5000 images is reported instead. |
| Hardware Specification | No | The paper mentions 'Appendix F: Computing Infrastructure' for details, but no specific hardware models (like GPU/CPU names or types) are detailed within the main text. |
| Software Dependencies | No | The abstract mentions 'Tensor Flow models', but no specific version number is provided for TensorFlow or any other software dependency. |
| Experiment Setup | Yes | Hyperparameter Variation In separate experiments, the activation function, dropout rate, weight decay, and learning rate multiplier were changed. In particular, the baseline comparison is the Glorot Uniform strategy (also called Xavier initialization; Glorot and Bengio 2010), where weights are sampled from U 6 fan in+fan out, 6 fan in+fan out. In particular, the initialization is He Normal (He et al. 2015), where weights are sampled from N(0, p2/fan in). All schedules included a linear warm-up phase followed by a decay to zero using cosine annealing (Loshchilov and Hutter 2016). The default initialization, which initializes convolutional layers from N(0, p2/fan out) and fully-connected layers from U 6 fan in+fan out, 6 fan in+fan out . With the default initialization, weights were sampled from U 6 fan in+fan out, 6 fan in+fan out (Glorot and Bengio 2010). With Auto Init, the weights were sampled from N 0, 1/pfan inµf to account for an arbitrary activation function f; the dropout adjustment (Section 5) was not used. Finally, Auto Init++ takes advantage of f as described above, but is otherwise identical to Auto Init. |