Deconstructing the Goldilocks Zone of Neural Network Initialization

Authors: Artem M Vysogorets, Anna Dawid, Julia Kempe

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper conducts an extensive study of the Goldilocks zone of homogeneous neural networks, both analytically and empirically. Our code is available at https://Git Hub.com/avysogorets/goldilocks-zone. In Section 3, we demonstrate that the Goldilocks zone is not characterized by the initialization norm alone, refining prior beliefs of Fort & Scherlis (2019). Instead, we derive a more fundamental condition resulting in excess of positive curvature and find that it disappears due to saturated softmax on one end and vanishing logit gradients on the other. In Section 4, we closely study the interior of the Goldilocks zone and analytically associate excess of positive curvature with low model confidence, low initial loss, and low cross-entropy gradient norm. In Section 5, we report the evolution and performance of scaled homogeneous networks when optimized by gradient descent both inside and outside the Goldilocks zone. Our investigation shows that excess of positive curvature is an imperfect estimator of the initialization trainability and exhibits a range of interesting effects for initializations on the edge.
Researcher Affiliation Academia 1Center for Data Science, New York University, 60 Fifth Ave, New York, NY 10011 2Center for Computational Quantum Physics, Flatiron Institute, 162 Fifth Ave, New York, NY 10010, USA 3Courant Institute, New York University, 251 Mercer St, New York, NY 10012.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://Git Hub.com/avysogorets/goldilocks-zone.
Open Datasets Yes In particular, we use a fully-connected Le Net-300-100 with two hidden layers on Fashion MNIST (Xiao et al., 2017) and a convolutional Le Net-5 with 4 hidden layers on CIFAR-10 (Le Cun et al., 1998; Krizhevsky, 2009), all implemented in Py Torch (Ansel et al., 2024).
Dataset Splits No The paper mentions using Fashion MNIST and CIFAR-10 datasets but does not explicitly provide details about training, validation, or test splits. It refers to a 'single balanced batch' for some figures, but not overall dataset splits.
Hardware Specification No The Acknowledgements section mentions 'NYU IT High Performance Computing resources', but no specific hardware details such as GPU/CPU models, processors, or memory specifications are provided.
Software Dependencies No The paper states that the models are 'all implemented in Py Torch (Ansel et al., 2024)' but does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes Experimental setup. We optimize α-scaled homogeneous networks using vanilla full-batch gradient descent for 20,000 epochs. In particular, we use a fully-connected Le Net-300-100 with two hidden layers on Fashion MNIST (Xiao et al., 2017) and a convolutional Le Net-5 with 4 hidden layers on CIFAR-10 (Le Cun et al., 1998; Krizhevsky, 2009), all implemented in Py Torch (Ansel et al., 2024). We set softmax temperature to T = 1 to link the Goldilocks zone to the initialization norm. Recall from Section 3 that logit gradients of the α-scaled network f are αL 1 times the respective logit gradients of the unscaled model f, so we multiply a preset learning rate η0 by α2 L to ensure that updates are initially commensurate to the weights of f .