Chaotic Dynamics are Intrinsic to Neural Network Training with SGD

Authors: Luis Herrmann, Maximilian Granz, Tim Landgraf

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main four contributions are as follows: 1. By modeling ANN training with SGD as a time-discrete dynamical system, we propose a modified SGD algorithm ensuring non-chaotic training dynamics to study the importance of chaos in ANN training. 2. We find empirical evidence suggesting that directions of negative curvature and thus local chaos cannot be removed without hurting the training performance of ANNs. 3. We show empirically that the network dynamics start out diverging exponentially at the beginning of the training but transition asymptotically against polynomial behaviour as the model performance converges. 4. Elaborating on the previous aspect, we show that even as the model training converges, the distance between similarly initialized models continues to grow at a small pace, and that this behaviour can be modelled as a sum of a linear divergence and a random walk. Our experiments were run on single GPU nodes of a system featuring an AMD Ryzen Threadripper 1950X processor, 4x Nvidia Ge Force RTX 2080 TI GPUs (11GB VRAM) and 64GB RAM, as well as on a second system featuring an Intel(R) Core(TM) i5-8600K, Nvidia Ge Force GTX 1080 Ti (11GB VRAM) and an Nvidia Titan XP (12GB VRAM) and 32GB of RAM. Both systems run on Debian Debian GNU/Linux 11 (bullseye). As datasets, we use USPS (Hull, 1994) and Fashion MNIST (Xiao et al., 2017) (with images subsampled to 16 16 pixels), since these datasets contain sufficiently low-dimensional, natural data to allow for the calculation of the Hessian and its eigenvalue decomposition at every training step, both for a 784-20-10 MLP and for a small 2D CNN. To modify the parameter updates of the model as proposed in Theorem 2.2, we use the Pytorch implementation of SGD and alter it slightly to implement an own class CGD (Chaos-sensitive Gradient Descent) with the ability to filter parts of the eigenvalue spectrum.
Researcher Affiliation Academia Luis M. Herrmann Center for Digital Health AG Roland Eils Berlin Institute of Health Kapelle-Ufer 2, 10117 Berlin luis.herrmann@charite.de Maximilian Granz FU Bio Robotics Lab Freie Universität Berlin Arnimallee 7, 14195 Berlin maximilian.granz@fu-berlin.de Tim Landgraf FU Bio Robotics Lab Freie Universität Berlin Arnimallee 7, 14195 Berlin tim.landgraf@fu-berlin.de
Pseudocode No The paper contains mathematical equations and theoretical derivations, but no explicitly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The code for all our experiments is written in Python using the Pytorch framework (Paszke et al., 2017), and available at Git Hub 2. 2https://github.com/luisherrmann/chaotic_neurips22
Open Datasets Yes As datasets, we use USPS (Hull, 1994) and Fashion MNIST (Xiao et al., 2017) (with images subsampled to 16 16 pixels), since these datasets contain sufficiently low-dimensional, natural data to allow for the calculation of the Hessian and its eigenvalue decomposition at every training step, both for a 784-20-10 MLP and for a small 2D CNN.
Dataset Splits No The paper mentions 'validation losses/accuracies' but does not specify the dataset splits (e.g., percentages or sample counts) used for training, validation, or testing.
Hardware Specification Yes Our experiments were run on single GPU nodes of a system featuring an AMD Ryzen Threadripper 1950X processor, 4x Nvidia Ge Force RTX 2080 TI GPUs (11GB VRAM) and 64GB RAM, as well as on a second system featuring an Intel(R) Core(TM) i5-8600K, Nvidia Ge Force GTX 1080 Ti (11GB VRAM) and an Nvidia Titan XP (12GB VRAM) and 32GB of RAM. Both systems run on Debian Debian GNU/Linux 11 (bullseye).
Software Dependencies No The paper mentions that code is written in 'Python using the Pytorch framework (Paszke et al., 2017)' and uses 'O2Grad (Anonymous, 2022)'. While PyTorch and O2Grad are named, no specific version numbers for Python, PyTorch, or O2Grad are provided, which are necessary for full reproducibility. The operating system 'Debian GNU/Linux 11 (bullseye)' is mentioned with a version.
Experiment Setup No The paper mentions 'SGD with learning rate γ (without momentum)' and specific activation functions like 'relu activation' and 'sigmoid activation'. However, it does not provide concrete hyperparameter values for the learning rate (γ), batch size, number of epochs, or other detailed optimizer settings that are crucial for full experimental reproducibility.