Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization
Authors: Davide Buffelli, Jamie McGowan, Wangkun Xu, Alexandru Cioba, Da-shan Shiu, Guillaume Hennequin, Alberto Bernacchia
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the training loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the lazy regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization. For our experiments we train Rev MLPs (equations (14) and (15)) with 2 (6) blocks for MNIST [Le Cun et al., 2010] (CIFAR-10; Krizhevsky, 2009)... |
| Researcher Affiliation | Collaboration | Davide Buffelli Media Tek Research Jamie Mc Gowan Media Tek Research Wangkun Xu Imperial College London Alexandru Cioba Media Tek Research Da-shan Shiu Media Tek Research Guillaume Hennequin Media Tek Research & University of Cambridge Alberto Bernacchia Media Tek Research |
| Pseudocode | No | Explanation: The paper provides mathematical equations (e.g., equations (14), (15), (16), (17)) for the Rev MLPs and GN updates, but it does not present them within a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Full hyperparameters and additional details are reported in Appendix L, and code is provided with the submission. |
| Open Datasets | Yes | For our experiments we train Rev MLPs (equations (14) and (15)) with 2 (6) blocks for MNIST [Le Cun et al., 2010] (CIFAR-10; Krizhevsky, 2009)... |
| Dataset Splits | Yes | Next, we consider the full MNIST and CIFAR-10 datasets in the mini-batch setting. We follow standard train and test splits for the datasets, with a mini-batch size of n = 1024. We tune the strength of the weight decay using the validation set. |
| Hardware Specification | Yes | All experiments are performed on a single NVIDIA RTXA6000 GPU. |
| Software Dependencies | Yes | Our code is based on the Py Torch framework Paszke et al. [2019]. In more detail we use version 2.0 for Linux with CUDA 12.1. |
| Experiment Setup | Yes | For our experiments we train Rev MLPs (equations (14) and (15)) with 2 (6) blocks for MNIST [Le Cun et al., 2010] (CIFAR-10; Krizhevsky, 2009), Re LU non-linearities at all half-coupled layers, and an inverted bottleneck of size 8000 resulting in models with 12M (MNIST) and 147M (CIFAR-10) parameters. At each training iteration, we compute the pseudoinverses in equations (16), (17) using an SVD. For numerical stability we truncate the SVD to a 1% tolerance relative to the largest singular value and an absolute tolerance of 10 5, whichever gives the smallest rank our main findings are qualitatively robust to these tolerance levels. We tuned the learning rate for each optimizer by selecting the largest one that did not cause the loss to diverge. Weight initialization. We use standard Xavier Glorot and Bengio [2010] initialization for the weights, while we initialize the biases to zero. Data augmentations. For the MNIST dataset we do not use any data augmentations. For the CIFAR-10 dataset we follow the standard practice of applying random crops and resizes. |