Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Authors: Feng Chen, Daniel Kunin, Atsushi Yamamura, Surya Ganguli

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework.
Researcher Affiliation Academia Feng Chen Daniel Kunin Atsushi Yamamura () Surya Ganguli Stanford University {fengc,kunin,atsushi3,sganguli}@stanford.edu
Pseudocode No The paper does not contain any structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The codes to reproduce the experiments in the main paper can be found at https://github. com/ccffccffcc/stochastic_collapse.
Open Datasets Yes We carried out all the deep learning experiments with VGG-16 [47] and Res Net-18 [48], training on the CIFAR-10 and CIFAR-100 datasets respectively [49].
Dataset Splits No The paper mentions training steps and evaluation but does not specify explicit train/validation/test dataset splits with percentages, sample counts, or references to predefined splits.
Hardware Specification Yes Our the experiments were run on the Google Cloud Platform (with 4 or 8 NVIDIA A100 (40GB) GPU). The initial code development occurred on a local cluster equipped with 10 NVIDIA TITAN X GPUs.
Software Dependencies No The paper mentions using SGD as an optimizer but does not provide specific version numbers for any software dependencies, libraries, or frameworks (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes For all our training, we applied standard data augmentation and used SGD (with momentum  = 0.9 and weight decay of 0.0005) as the optimizer. We trained VGG-16 for 105 steps on CIFAR-10 with a learning rate of 0.1 and Res Net-18 for 106 steps on CIFAR-100 with a learning rate of 0.02.