On skip connections and normalisation layers in deep optimisation
Authors: Lachlan MacDonald, Jack Valmadre, Hemanth Saratchandran, Simon Lucey
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then demonstrate the utility of this framework in two respects. First, we give the only proof of which we are aware that a class of deep neural networks can be trained using gradient descent to global optima even when such optima only exist at infinity, as is the case for the cross-entropy cost. Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with Res Nets on MNIST, CIFAR10, CIFAR100 and Image Net. |
| Researcher Affiliation | Academia | Lachlan E. Mac Donald Mathematical Institute for Data Science Johns Hopkins University lemacdonald@protonmail.com Jack Valmadre Australian Institute for Machine Learning University of Adelaide Hemanth Saratchandran Australian Institute for Machine Learning University of Adelaide Simon Lucey Australian Institute for Machine Learning University of Adelaide |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code is available at https://github.com/lemacdonald/skip-connections-normalisation/. |
| Open Datasets | Yes | Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with Res Nets on MNIST, CIFAR10, CIFAR100 and Image Net. |
| Dataset Splits | No | The paper mentions using a 'training set' and 'validation accuracies' for standard datasets (MNIST, CIFAR, ImageNet) but does not provide specific split percentages, sample counts, or explicit citations to predefined splits. For example, it does not state '80/10/10 split' or cite a paper for the exact split used. |
| Hardware Specification | Yes | Both the exploratory and final experiments for this paper were conducted using a desktop machine with two Nvidia RTX A6000 GPUs, for a total running time of approximately 150 hours. |
| Software Dependencies | No | The paper mentions using 'Py Torch' but does not specify a version number for PyTorch or any other software libraries, which is required for reproducibility. |
| Experiment Setup | Yes | On CIFAR10/100, our models9 were trained using SGD with a batch size of 128 and random crop/horizontal flip data augmentation. We ran 10 trials over each of the learning rates 0.2, 0.1, 0.05 and 0.02. On Image Net, the models were trained using the default Py Torch Image Net example10, using SGD with weight decay of 1e 4 and momentum of 0.9, batch size of 256, and random crop/horizontal flip data augmentation. |