Theoretical Characterisation of the Gauss Newton Conditioning in Neural Networks

Authors: Jim Zhao, Sidak Pal Singh, Aurelien Lucchi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we take a first step towards theoretically characterizing the conditioning of the GN matrix in neural networks. We establish tight bounds on the condition number of the GN in deep linear networks of arbitrary depth and width, which we also extend to two-layer Re LU networks. We expand the analysis to further architectural components, such as residual connections and convolutional layers. Finally, we empirically validate the bounds and uncover valuable insights into the influence of the analyzed architectural components.
Researcher Affiliation Academia Jim Zhao University of Basel, Switzerland jim.zhao@unibas.chSidak Pal Singh ETH Zürich, Switzerland sidak.singh@inf.ethz.chAurelien Lucchi University of Basel, Switzerland aurelien.lucchi@unibas.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code to run the experiments is provided in the supplementary material together with instructions how to reproduce the results.
Open Datasets Yes The empirical results in Figure 2a show that the derived bound seems to be tight and predictive of the trend of the condition number of GN at initialization. If the width of the hidden layer is held constant, the condition number grows with a quadratic trend. However, the condition number can be controlled if the width is scaled proportionally with the depth. This gives another explanation of why in practice the width of the network layers is scaled proportionally with the depth to enable faster network training. (Figure 2a caption): “whitened MNIST”. (Figure 5 caption): “whitened MNIST (left) and whitened Cifar-10 (right)”. (Section I.3): “if not otherwise specified, MNIST Le Cun et al. [1998] and Cifar-10 Krizhevsky et al. [2009] will refer to whitened data”.
Dataset Splits No The paper does not provide specific training/validation/test dataset splits. The NeurIPS checklist states: “Since our work only evaluated the training setting, a data split was not necessary.”
Hardware Specification Yes The network was trained on a single NVIDIA Ge Force RTX 3090 GPU and took around 5 minutes per run. (Section D) The Vi T on a single GPU NVIDIA Ge Force RTX 4090. (Section I.1) Res Net20, Res Net32 and the Feed-forward network were trained on a single NVIDIA Ge Force RTX 3090 GPU. (Section I.1)
Software Dependencies No The paper mentions PyTorch but does not specify a version number (e.g., “PyTorch [Paszke et al., 2019]”). It also refers to external codebases for model implementations without specifying version numbers for dependencies.
Experiment Setup Yes The network was trained with SGD with a mini-batch size of 256 and a constant learning rate of 0.2 (Section D). The Vi T was trained with Adam W with a learning rate of 1e-2 and weight decay 1e-2. The Res Net20, Res Net32 and VGGnet were trained with SGD with momentum = 0.9 and weight decay of 10-4 and a learning rate of 0.1 with a step decay to 0.01 after 91 epochs for Res Net20 and Res Net32 and after 100 epochs for the VGGnet. The Feed-forward network was trained with SGD with a constant learning rate of 0.01. All networks were trained with a mini-batch size of 64... (Section I.1)