On the Weight Dynamics of Deep Normalized Networks

Authors: Christian H.X. Ali Mehmeti-Göpel, Michael Wand

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experimental Validation In this experimental Section, we will first check the limitations of the assumption about constant base gradients and validate the predictivity of our model. Then, we will compare the predicted critical learning rate to an empirical value extracted from real training runs. Finally, we confirm that high ELR spreads correlate with network trainability in practice. ... We chose Res Net v1 (He et al., 2016a) with ( Short ) and without ( No Short ) residual connections as examples of standard architectures. ... We therefore use 56 and 110 layer networks: ... For computer vision tasks, we work with standard image classification datasets of variable difficulty: CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and ILSVRC 2012 (called Image Net in the following) (Deng et al., 2009). ... In Figure 5 (top), we see that for the No Short networks in regular training without warm-up ( base ), spreads (averaged over the training run) are very high and trainability is very low. Using skip connections (bottom), spreads are much lower and the network is able to train.
Researcher Affiliation Academia 1Department of Computer Science, Johannes-Gutenberg University, Mainz, Germany. Correspondence to: Christian H.X. Ali Mehmeti-G opel <chalimeh@unimainz.de>.
Pseudocode Yes Algorithm 1 Random Walk Let eℓ denote the number of elements of the weight vector Wℓ and , the dot product.
Open Source Code No The paper does not provide explicit statements about releasing source code for the described methodology or links to a code repository.
Open Datasets Yes We chose a Res Net v1 as opposed to a v2 since in the former, the correct placement of normalization layers (ref. Section 3.1) is given without modifying the architecture. ... For computer vision tasks, we work with standard image classification datasets of variable difficulty: CIFAR-10, CIFAR-100 (Krizhevsky, 2009) and ILSVRC 2012 (called Image Net in the following) (Deng et al., 2009).
Dataset Splits No The paper mentions using standard datasets like CIFAR-10/100 and ImageNet. While these datasets have conventional splits, the paper does not explicitly specify the exact training, validation, and test split percentages, sample counts, or the methodology for their specific experiments to enable precise reproduction of data partitioning beyond relying on general knowledge of these datasets' common usage.
Hardware Specification Yes Various Nvidia GPUS were used ranging from Ge Force GTX 1080TI, Ge Force GTX 2080Ti RTX 4090.
Software Dependencies Yes The experiments in the paper were made on computers running Arch Linux, Python 3.11.5, Py Torch Version 2.1.2+cu121.
Experiment Setup Yes We use the most basic training setting possible (vanilla SGD) and disable all possible factors that influence weight dynamics: momentum, weight decay, affine Batch Norm parameters and bias on linear layers (for a discussion, please refer to Appendix Section C). We further use different kinds of learning rate scheduling with and without warm-up; further details about the architectures and training process can be found in the Appendix. ... Table 2. Network architecture and training regime used for the Cifar10/100 task. ... Table 3. Network architecture and training regime used for the Image Net task.