On skip connections and normalisation layers in deep optimisation

Authors: Lachlan MacDonald, Jack Valmadre, Hemanth Saratchandran, Simon Lucey

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then demonstrate the utility of this framework in two respects. First, we give the only proof of which we are aware that a class of deep neural networks can be trained using gradient descent to global optima even when such optima only exist at infinity, as is the case for the cross-entropy cost. Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with Res Nets on MNIST, CIFAR10, CIFAR100 and Image Net.
Researcher Affiliation Academia Lachlan E. Mac Donald Mathematical Institute for Data Science Johns Hopkins University lemacdonald@protonmail.com Jack Valmadre Australian Institute for Machine Learning University of Adelaide Hemanth Saratchandran Australian Institute for Machine Learning University of Adelaide Simon Lucey Australian Institute for Machine Learning University of Adelaide
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes All code is available at https://github.com/lemacdonald/skip-connections-normalisation/.
Open Datasets Yes Second, we identify a novel causal mechanism by which skip connections accelerate training, which we verify predictively with Res Nets on MNIST, CIFAR10, CIFAR100 and Image Net.
Dataset Splits No The paper mentions using a 'training set' and 'validation accuracies' for standard datasets (MNIST, CIFAR, ImageNet) but does not provide specific split percentages, sample counts, or explicit citations to predefined splits. For example, it does not state '80/10/10 split' or cite a paper for the exact split used.
Hardware Specification Yes Both the exploratory and final experiments for this paper were conducted using a desktop machine with two Nvidia RTX A6000 GPUs, for a total running time of approximately 150 hours.
Software Dependencies No The paper mentions using 'Py Torch' but does not specify a version number for PyTorch or any other software libraries, which is required for reproducibility.
Experiment Setup Yes On CIFAR10/100, our models9 were trained using SGD with a batch size of 128 and random crop/horizontal flip data augmentation. We ran 10 trials over each of the learning rates 0.2, 0.1, 0.05 and 0.02. On Image Net, the models were trained using the default Py Torch Image Net example10, using SGD with weight decay of 1e 4 and momentum of 0.9, batch size of 256, and random crop/horizontal flip data augmentation.